[ https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dinesh S. Atreya reassigned HADOOP-12620: ----------------------------------------- Assignee: Dinesh S. Atreya > Advanced Hadoop Architecture (AHA) - Common > ------------------------------------------- > > Key: HADOOP-12620 > URL: https://issues.apache.org/jira/browse/HADOOP-12620 > Project: Hadoop Common > Issue Type: New Feature > Reporter: Dinesh S. Atreya > Assignee: Dinesh S. Atreya > > h1. Advance Hadoop Architecture (AHA) / Advance Hadoop Adaptabilities (AHA) > One main motivation for this JIRA is to address a comprehensive set of uses > with just minimal enhancements to Hadoop to transition Hadoop to > Advanced/Cloud Data Architecture. > HDFS has traditionally had a write-once-read-many access model for files > until “[Append to files in HDFS | > https://issues.apache.org/jira/browse/HADOOP-1700 ]” capability was > introduced. The next minimal enhancements to core Hadoop include capability > to do “updates-in-place” in HDFS. > • Support seeks for writes (in addition to reads). > • After seek, if the new byte length is the same as the old byte length, > in place update is allowed. > • Delete is an update with appropriate Delete marker > • If byte length is different, old entry is marked as delete with new one > appended as before. > • It is the client’s discretion to perform either update, append or both > and the API changes in different Hadoop components should provide these > capabilities. > Please note that this JIRA is limited to essentially a specific type of > updates, in-place updates that do not change the byte length (e.g., buffer > spaces are included in the length). Updates that change the byte length are > not-supported in-place and are considered as Appends/Inserts. Similarly > Deletes that create holes are not supported. The reason is simple, > fragmentations and holes cause performance penalties and make the process > complicated and may involve a lot of changes to Hadoop and are out-of-scope. > These minimal changes will enable laying the basis for transforming the core > Hadoop to an interactive and real-time platform and introducing significant > native capabilities to Hadoop. These enhancements will lay a foundation for > all of the following processing styles to be supported natively and > dynamically. > • Real time > • Mini-batch > • Stream based data processing > • Batch – which is the default now. > Hadoop engines can dynamically choose processing style to use based on the > type of data and volume of data sets and enhance/replace prevailing > approaches. > With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O > resources with increasing efficiency. The Hadoop task engines can use > vectorized/pipelined processing and greater use of memory throughout the > Hadoop platform. > These will enable enhanced performance optimizations to be implemented in > HDFS and made available to all the Hadoop components. This will enable Fast > processing of Big Data and enhance all the characteristics volume, velocity > and variety of big data. > There are many influences for this umbrella JIRA: > • Preserve and Accelerate Hadoop > • Efficient Data Management of variety of Data Formats natively in Hadoop > • Enterprise Expansion > • Internet and Media > • Databases offer native support for a variety of Data Formats such as > JSON, XML Indexes, and Temporal etc. – Hadoop should do the same. > It is quite probable that there may be many sub-JIRAs created to address > portions of this. This JIRA captures a variety of use-cases in one place. > Some Data Management /Platform initial use-cases are given hereunder. > h2. WEB > With the AHA (Advance Hadoop Architecture) enhancements, a variety of Web > standards can be natively supported such as updateable JSON > [http://json.org/], XML, RDF and other documents. > While Hadoop origination can be traced to the WEB, some of the [Web standards > | http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html] are not completely > supported natively in Hadoop such as HTTP PUT and PATCH (PUT and POST are > only partially supported in terms of creation). With the proposed enhancement > all of the standards POST, PUT and PATCH (new addition to Web standards) can > be natively completely supported (in addition to GET) through Hadoop. > Hypertext Transfer Protocol -- HTTP/1.1 ([Original RFC | > http://tools.ietf.org/html/rfc2616], [Current RFC | > http://tools.ietf.org/html/rfc7231] ) > Current RFCS: > • [ Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing | > http://tools.ietf.org/html/rfc7230] > • [ Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content | > http://tools.ietf.org/html/rfc7231 ] > • [ Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests | > http://tools.ietf.org/html/rfc7232 ] > • [ Hypertext Transfer Protocol (HTTP/1.1): Range Requests | > http://tools.ietf.org/html/rfc7233 ] > • [ Hypertext Transfer Protocol (HTTP/1.1): Caching | > http://tools.ietf.org/html/rfc7234 ] > • [ Hypertext Transfer Protocol (HTTP/1.1): Authentication | > http://tools.ietf.org/html/rfc7235 ] > > h3. HTTP PATCH RFC > RFC ([PATCH Method for HTTP | http://tools.ietf.org/html/rfc5789#section-9.1) > provides direct support for updates. > Roy Fielding himself said that [PATCH was something he created for the > initial HTTP/1.1 proposal because partial PUT is never RESTful | > https://twitter.com/fielding/status/275471320685367296 ]. With HTTP PATCH > you are not transferring a complete representation, but REST does not require > representations to be complete anyway. > The method PATCH is not idempotent. With the proposed enhancement, we can now > formalize the behavior and provide feedback to the Web standard RFC. > • If the update can be carried out in-place, it is idempotent. > • If the update causes new data (first entry marked as delete along with > corresponding insert/append), then it is not idempotent. > h3. JSON > Some RFCs for JSON are given hereunder. > • [JavaScript Object Notation (JSON) Patch | > http://tools.ietf.org/html/rfc6902 ] > • [JSON Merge Patch | https://tools.ietf.org/html/rfc7386 ] > h3. RDF > RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ > RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ > The simplest triple statement is a sequence of (subject, predicate, object) > terms, separated by whitespace and terminated by '.' after each triple. > h2. Mobile Apps Data and Resources > With the enhancements proposed, in addition to the Web, Apps Data and > Resources can also be managed using the Hadoop . Some examples of such usage > can include App Data and Resources for Apple and other App stores. > About Apps Resources: > https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html > > On-Demand Resources Essentials: > https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/ > > Resource Programming Guide: > https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf > > h2. Natural Support for ETL and Analytics > With native support for updates and deletes in addition to appends/inserts, > Hadoop will have proper and natural support for ETL and Analytics. > h2. Key-Value Store > With the proposed enhancements, it will become very convenient to implement > Key-Value Store natively in Hadoop. > h2. MVCC (Multi Version Concurrency Control) > Modified example of how MVCC can be implemented with the proposed > enhancements from PostgreSQL MVCC is given hereunder. > https://wiki.postgresql.org/wiki/MVCC > http://momjian.us/main/writings/pgsql/mvcc.pdf > || Data ID || Activity || Data Create Counter || Data Expiry Counter || > Comments || > | 1 | Insert | 40 | MAX_VAL | Conventionally MAX_VAL is > null. > In order to maintain update size, MAX_VAL is pre-seeded for our purposes. | > | 1 | Delete | 40 | 47 | Marked as delete when current counter > was 47. | > | 2 | Update (old Delete) | 64 | 78 | Mark old data is DELETE | > | 2 | Update (new insert) | 78 | MAX_VAL | Insert new data. | > h2. Graph Stores > Enable native storage and processing for a variety of graph stores. > h3. Graph Store 1 (Spark GraphX) > 1. EdgeTable(pid, src, dst, data): stores the adjacency > structure and edge data. Each edge is represented as a > tuple consisting of the source vertex id, destination vertex id, > and user-defined data as well as a virtual partition identifier > (pid). Note that the edge table contains only the vertex ids > and not the vertex data. The edge table is partitioned by the > pid > 2. VertexDataTable(id, data): stores the vertex data, > in the form of a vertex (id, data) pairs. The vertex data table > is indexed and partitioned by the vertex id. > 3. VertexMap(id, pid): provides a mapping from the id > of a vertex to the ids of the virtual partitions that contain > adjacent edges. > h3. Graph Store 2 (Facebook Social Graph - TAO) > Object: (id) → (otype,(key → value)∗ ) > Assoc.: (id1,atype,id2) → (time,(key → value) ∗ ) > TAO: Facebook’s Distributed Data Store for the Social Graph > https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson > > https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf > > TAO: The power of the graph > https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-thegraph/10151525983993920 > > h2. Temporal Data > https://en.wikipedia.org/wiki/Temporal_database > https://en.wikipedia.org/wiki/Valid_time > In temporal data, data may get updated to reflect changes in data. > For example data change from > Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) > Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001) > to > Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) > Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995) > Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000) > Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001) > h2. Media > Media production typically involves a lot of changes and updates prior to > release. The enhancements will lay a basis for the full lifecycle to be > managed in Hadoop ecosystem. > h2. Indexes > With the changes, a variety of updatable indexes can be supported natively in > Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn > leverage Hadoop’s enhanced native capabilities. > h2. Google References > While Google’s research in this area is interesting (and some extracts are > listed hereunder), the evolution of Hadoop is quite interesting. Proposed > enhancements to support in-place-update to the core Hadoop will enable and > make it easier for a variety of enhancements for each of the Hadoop > components and has a variety of influences as has been indicated in this JIRA. > We propose a basis for allowing a system for incrementally processing updates > to large data sets and reduce the overhead of always having to do large > batches. Hadoop engines can dynamically choose processing style to use based > on the type of data and volume of data sets and enhance/replace prevailing > approaches. > || Year || Title || Links || > | 2015 | Announcing Google Cloud Bigtable: The same database that > powers Google Search, Gmail and Analytics is now available on Google Cloud > Platform > | > http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html > https://cloud.google.com/bigtable/ > | 2014 | Mesa: Geo-Replicated, Near Real-Time, Scalable Data > Warehousing | > http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf > > | 2013 | F1: A Distributed SQL Database That Scales | > http://research.google.com/pubs/pub41344.html > | 2013 | Online, Asynchronous Schema Change in F1 | > http://research.google.com/pubs/pub41376.html > | 2013 | Photon: Fault-tolerant and Scalable Joining of Continuous > Data Streams | > http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf > > | 2012 | F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's > Ad Business | http://research.google.com/pubs/pub38125.html > | 2012 | Spanner: Google's Globally-Distributed Database | > http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf > > | 2012 | Clydesdale: structured data processing on MapReduce | > http://dl.acm.org/citation.cfm?doid=2247596.2247600 > | 2011 | Megastore: Providing Scalable, Highly Available Storage for > Interactive Services | http://research.google.com/pubs/pub36971.html > | 2011 | Tenzing A SQL Implementation On The MapReduce Framework > | http://research.google.com/pubs/pub37200.html > | 2010 | Dremel: Interactive Analysis of Web-Scale Datasets | > http://research.google.com/pubs/pub36632.html > | 2010 | FlumeJava: Easy, Efficient Data-Parallel Pipelines | > http://research.google.com/pubs/pub35650.html > | 2010 | Percolator: Large-scale Incremental Processing Using > Distributed Transactions and Notifications | > http://research.google.com/pubs/pub36726.html > https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf > h2.Application Domains > The enhancements will lay a path for comprehensive support of all application > domains in Hadoop. A small collection is given hereunder. > Data Warehousing and Enhanced ETL processing > Supply Chain Planning > Web Sites > Mobile App Stores > Financials > Media > Machine Learning > Social Media > Enterprise Applications such as ERP, CRM > Corresponding umbrella JIRAs can be found for each of the following Hadoop > platform components. -- This message was sent by Atlassian JIRA (v6.3.4#6332)