[
https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dinesh S. Atreya reassigned HADOOP-12620:
-----------------------------------------
Assignee: Dinesh S. Atreya
> Advanced Hadoop Architecture (AHA) - Common
> -------------------------------------------
>
> Key: HADOOP-12620
> URL: https://issues.apache.org/jira/browse/HADOOP-12620
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: Dinesh S. Atreya
> Assignee: Dinesh S. Atreya
>
> h1. Advance Hadoop Architecture (AHA) / Advance Hadoop Adaptabilities (AHA)
> One main motivation for this JIRA is to address a comprehensive set of uses
> with just minimal enhancements to Hadoop to transition Hadoop to
> Advanced/Cloud Data Architecture.
> HDFS has traditionally had a write-once-read-many access model for files
> until “[Append to files in HDFS |
> https://issues.apache.org/jira/browse/HADOOP-1700 ]” capability was
> introduced. The next minimal enhancements to core Hadoop include capability
> to do “updates-in-place” in HDFS.
> • Support seeks for writes (in addition to reads).
> • After seek, if the new byte length is the same as the old byte length,
> in place update is allowed.
> • Delete is an update with appropriate Delete marker
> • If byte length is different, old entry is marked as delete with new one
> appended as before.
> • It is the client’s discretion to perform either update, append or both
> and the API changes in different Hadoop components should provide these
> capabilities.
> Please note that this JIRA is limited to essentially a specific type of
> updates, in-place updates that do not change the byte length (e.g., buffer
> spaces are included in the length). Updates that change the byte length are
> not-supported in-place and are considered as Appends/Inserts. Similarly
> Deletes that create holes are not supported. The reason is simple,
> fragmentations and holes cause performance penalties and make the process
> complicated and may involve a lot of changes to Hadoop and are out-of-scope.
> These minimal changes will enable laying the basis for transforming the core
> Hadoop to an interactive and real-time platform and introducing significant
> native capabilities to Hadoop. These enhancements will lay a foundation for
> all of the following processing styles to be supported natively and
> dynamically.
> • Real time
> • Mini-batch
> • Stream based data processing
> • Batch – which is the default now.
> Hadoop engines can dynamically choose processing style to use based on the
> type of data and volume of data sets and enhance/replace prevailing
> approaches.
> With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O
> resources with increasing efficiency. The Hadoop task engines can use
> vectorized/pipelined processing and greater use of memory throughout the
> Hadoop platform.
> These will enable enhanced performance optimizations to be implemented in
> HDFS and made available to all the Hadoop components. This will enable Fast
> processing of Big Data and enhance all the characteristics volume, velocity
> and variety of big data.
> There are many influences for this umbrella JIRA:
> • Preserve and Accelerate Hadoop
> • Efficient Data Management of variety of Data Formats natively in Hadoop
> • Enterprise Expansion
> • Internet and Media
> • Databases offer native support for a variety of Data Formats such as
> JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.
> It is quite probable that there may be many sub-JIRAs created to address
> portions of this. This JIRA captures a variety of use-cases in one place.
> Some Data Management /Platform initial use-cases are given hereunder.
> h2. WEB
> With the AHA (Advance Hadoop Architecture) enhancements, a variety of Web
> standards can be natively supported such as updateable JSON
> [http://json.org/], XML, RDF and other documents.
> While Hadoop origination can be traced to the WEB, some of the [Web standards
> | http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html] are not completely
> supported natively in Hadoop such as HTTP PUT and PATCH (PUT and POST are
> only partially supported in terms of creation). With the proposed enhancement
> all of the standards POST, PUT and PATCH (new addition to Web standards) can
> be natively completely supported (in addition to GET) through Hadoop.
> Hypertext Transfer Protocol -- HTTP/1.1 ([Original RFC |
> http://tools.ietf.org/html/rfc2616], [Current RFC |
> http://tools.ietf.org/html/rfc7231] )
> Current RFCS:
> • [ Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing |
> http://tools.ietf.org/html/rfc7230]
> • [ Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content |
> http://tools.ietf.org/html/rfc7231 ]
> • [ Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests |
> http://tools.ietf.org/html/rfc7232 ]
> • [ Hypertext Transfer Protocol (HTTP/1.1): Range Requests |
> http://tools.ietf.org/html/rfc7233 ]
> • [ Hypertext Transfer Protocol (HTTP/1.1): Caching |
> http://tools.ietf.org/html/rfc7234 ]
> • [ Hypertext Transfer Protocol (HTTP/1.1): Authentication |
> http://tools.ietf.org/html/rfc7235 ]
>
> h3. HTTP PATCH RFC
> RFC ([PATCH Method for HTTP | http://tools.ietf.org/html/rfc5789#section-9.1)
> provides direct support for updates.
> Roy Fielding himself said that [PATCH was something he created for the
> initial HTTP/1.1 proposal because partial PUT is never RESTful |
> https://twitter.com/fielding/status/275471320685367296 ]. With HTTP PATCH
> you are not transferring a complete representation, but REST does not require
> representations to be complete anyway.
> The method PATCH is not idempotent. With the proposed enhancement, we can now
> formalize the behavior and provide feedback to the Web standard RFC.
> • If the update can be carried out in-place, it is idempotent.
> • If the update causes new data (first entry marked as delete along with
> corresponding insert/append), then it is not idempotent.
> h3. JSON
> Some RFCs for JSON are given hereunder.
> • [JavaScript Object Notation (JSON) Patch |
> http://tools.ietf.org/html/rfc6902 ]
> • [JSON Merge Patch | https://tools.ietf.org/html/rfc7386 ]
> h3. RDF
> RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/
> RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/
> The simplest triple statement is a sequence of (subject, predicate, object)
> terms, separated by whitespace and terminated by '.' after each triple.
> h2. Mobile Apps Data and Resources
> With the enhancements proposed, in addition to the Web, Apps Data and
> Resources can also be managed using the Hadoop . Some examples of such usage
> can include App Data and Resources for Apple and other App stores.
> About Apps Resources:
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
>
> On-Demand Resources Essentials:
> https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
>
> Resource Programming Guide:
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
>
> h2. Natural Support for ETL and Analytics
> With native support for updates and deletes in addition to appends/inserts,
> Hadoop will have proper and natural support for ETL and Analytics.
> h2. Key-Value Store
> With the proposed enhancements, it will become very convenient to implement
> Key-Value Store natively in Hadoop.
> h2. MVCC (Multi Version Concurrency Control)
> Modified example of how MVCC can be implemented with the proposed
> enhancements from PostgreSQL MVCC is given hereunder.
> https://wiki.postgresql.org/wiki/MVCC
> http://momjian.us/main/writings/pgsql/mvcc.pdf
> || Data ID || Activity || Data Create Counter || Data Expiry Counter ||
> Comments ||
> | 1 | Insert | 40 | MAX_VAL | Conventionally MAX_VAL is
> null.
> In order to maintain update size, MAX_VAL is pre-seeded for our purposes. |
> | 1 | Delete | 40 | 47 | Marked as delete when current counter
> was 47. |
> | 2 | Update (old Delete) | 64 | 78 | Mark old data is DELETE |
> | 2 | Update (new insert) | 78 | MAX_VAL | Insert new data. |
> h2. Graph Stores
> Enable native storage and processing for a variety of graph stores.
> h3. Graph Store 1 (Spark GraphX)
> 1. EdgeTable(pid, src, dst, data): stores the adjacency
> structure and edge data. Each edge is represented as a
> tuple consisting of the source vertex id, destination vertex id,
> and user-defined data as well as a virtual partition identifier
> (pid). Note that the edge table contains only the vertex ids
> and not the vertex data. The edge table is partitioned by the
> pid
> 2. VertexDataTable(id, data): stores the vertex data,
> in the form of a vertex (id, data) pairs. The vertex data table
> is indexed and partitioned by the vertex id.
> 3. VertexMap(id, pid): provides a mapping from the id
> of a vertex to the ids of the virtual partitions that contain
> adjacent edges.
> h3. Graph Store 2 (Facebook Social Graph - TAO)
> Object: (id) → (otype,(key → value)∗ )
> Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
> TAO: Facebook’s Distributed Data Store for the Social Graph
> https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson
>
> https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf
>
> TAO: The power of the graph
> https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-thegraph/10151525983993920
>
> h2. Temporal Data
> https://en.wikipedia.org/wiki/Temporal_database
> https://en.wikipedia.org/wiki/Valid_time
> In temporal data, data may get updated to reflect changes in data.
> For example data change from
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
> to
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
> Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
> Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
> h2. Media
> Media production typically involves a lot of changes and updates prior to
> release. The enhancements will lay a basis for the full lifecycle to be
> managed in Hadoop ecosystem.
> h2. Indexes
> With the changes, a variety of updatable indexes can be supported natively in
> Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn
> leverage Hadoop’s enhanced native capabilities.
> h2. Google References
> While Google’s research in this area is interesting (and some extracts are
> listed hereunder), the evolution of Hadoop is quite interesting. Proposed
> enhancements to support in-place-update to the core Hadoop will enable and
> make it easier for a variety of enhancements for each of the Hadoop
> components and has a variety of influences as has been indicated in this JIRA.
> We propose a basis for allowing a system for incrementally processing updates
> to large data sets and reduce the overhead of always having to do large
> batches. Hadoop engines can dynamically choose processing style to use based
> on the type of data and volume of data sets and enhance/replace prevailing
> approaches.
> || Year || Title || Links ||
> | 2015 | Announcing Google Cloud Bigtable: The same database that
> powers Google Search, Gmail and Analytics is now available on Google Cloud
> Platform
> |
> http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
> https://cloud.google.com/bigtable/
> | 2014 | Mesa: Geo-Replicated, Near Real-Time, Scalable Data
> Warehousing |
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
>
> | 2013 | F1: A Distributed SQL Database That Scales |
> http://research.google.com/pubs/pub41344.html
> | 2013 | Online, Asynchronous Schema Change in F1 |
> http://research.google.com/pubs/pub41376.html
> | 2013 | Photon: Fault-tolerant and Scalable Joining of Continuous
> Data Streams |
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
>
> | 2012 | F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's
> Ad Business | http://research.google.com/pubs/pub38125.html
> | 2012 | Spanner: Google's Globally-Distributed Database |
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
>
> | 2012 | Clydesdale: structured data processing on MapReduce |
> http://dl.acm.org/citation.cfm?doid=2247596.2247600
> | 2011 | Megastore: Providing Scalable, Highly Available Storage for
> Interactive Services | http://research.google.com/pubs/pub36971.html
> | 2011 | Tenzing A SQL Implementation On The MapReduce Framework
> | http://research.google.com/pubs/pub37200.html
> | 2010 | Dremel: Interactive Analysis of Web-Scale Datasets |
> http://research.google.com/pubs/pub36632.html
> | 2010 | FlumeJava: Easy, Efficient Data-Parallel Pipelines |
> http://research.google.com/pubs/pub35650.html
> | 2010 | Percolator: Large-scale Incremental Processing Using
> Distributed Transactions and Notifications |
> http://research.google.com/pubs/pub36726.html
> https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf
> h2.Application Domains
> The enhancements will lay a path for comprehensive support of all application
> domains in Hadoop. A small collection is given hereunder.
> Data Warehousing and Enhanced ETL processing
> Supply Chain Planning
> Web Sites
> Mobile App Stores
> Financials
> Media
> Machine Learning
> Social Media
> Enterprise Applications such as ERP, CRM
> Corresponding umbrella JIRAs can be found for each of the following Hadoop
> platform components.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)