[jira] [Assigned] (HADOOP-12620) Advanced Hadoop Architecture (AHA) - Common

Dinesh S. Atreya (JIRA) Mon, 07 Dec 2015 16:34:44 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dinesh S. Atreya reassigned HADOOP-12620:
-----------------------------------------

    Assignee: Dinesh S. Atreya

> Advanced Hadoop Architecture (AHA) - Common
> -------------------------------------------
>
>                 Key: HADOOP-12620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12620
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Dinesh S. Atreya
>            Assignee: Dinesh S. Atreya
>
> h1. Advance Hadoop Architecture (AHA) / Advance Hadoop Adaptabilities (AHA)
> One main motivation for this JIRA is to address a comprehensive set of uses 
> with just minimal enhancements to Hadoop to transition Hadoop to 
> Advanced/Cloud Data Architecture. 
> HDFS has traditionally had a write-once-read-many access model for files 
> until  “[Append to files in HDFS | 
> https://issues.apache.org/jira/browse/HADOOP-1700 ]”  capability was 
> introduced. The next minimal enhancements to core Hadoop include capability 
> to do “updates-in-place” in HDFS. 
> •     Support seeks for writes (in addition to reads).
> •     After seek, if the new byte length is the same as the old byte length, 
> in place update is allowed.
> •     Delete is an update with appropriate Delete marker
> •     If byte length is different, old entry is marked as delete with new one 
> appended as before. 
> •     It is the client’s discretion to perform either update, append or both 
> and the API changes in different Hadoop components should provide these 
> capabilities.
> Please note that this JIRA is limited to essentially a specific type of 
> updates, in-place updates that do not change the byte length (e.g., buffer 
> spaces are included in the length).  Updates that change the byte length are 
> not-supported in-place and are considered as Appends/Inserts. Similarly 
> Deletes that create holes are not supported. The reason is simple, 
> fragmentations and holes cause performance penalties and make the process 
> complicated and may involve a lot of changes to Hadoop and are out-of-scope.
> These minimal changes will enable laying the basis for transforming the core 
> Hadoop to an interactive and real-time platform and introducing significant 
> native capabilities to Hadoop. These enhancements will lay a foundation for 
> all of the following processing styles to be supported natively and 
> dynamically. 
> •     Real time 
> •     Mini-batch  
> •     Stream based data processing
> •     Batch – which is the default now.
> Hadoop engines can dynamically choose processing style to use based on the 
> type of data and volume of data sets and enhance/replace prevailing 
> approaches.
> With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O 
> resources  with increasing efficiency. The Hadoop task engines can use 
> vectorized/pipelined processing and greater use of memory throughout the 
> Hadoop platform. 
> These will enable enhanced performance optimizations to be implemented in 
> HDFS and made available to all the Hadoop components. This will enable Fast 
> processing of Big Data and enhance all the characteristics volume, velocity 
> and variety of big data.
> There are many influences for this umbrella JIRA:
> •     Preserve and Accelerate Hadoop
> •     Efficient Data Management of variety of Data Formats natively in Hadoop
> •     Enterprise Expansion 
> •     Internet and Media 
> •     Databases offer native support for a variety of Data Formats such as 
> JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.
> It is quite probable that there may be many sub-JIRAs created to address 
> portions of this. This JIRA captures a variety of use-cases in one place.  
> Some Data Management /Platform initial use-cases are given hereunder.
> h2. WEB
> With the AHA (Advance Hadoop Architecture) enhancements, a variety of Web 
> standards can be natively supported  such as updateable JSON 
> [http://json.org/], XML, RDF and other documents.
> While Hadoop origination can be traced to the WEB, some of the [Web standards 
> | http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html] are not completely 
> supported natively in Hadoop such as HTTP PUT and PATCH (PUT and POST are 
> only partially supported in terms of creation). With the proposed enhancement 
> all of the standards POST, PUT and PATCH (new addition to Web standards) can 
> be natively completely supported (in addition to GET) through Hadoop. 
> Hypertext Transfer Protocol -- HTTP/1.1 ([Original RFC | 
> http://tools.ietf.org/html/rfc2616], [Current RFC | 
> http://tools.ietf.org/html/rfc7231] ) 
> Current RFCS:
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing | 
> http://tools.ietf.org/html/rfc7230]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content | 
> http://tools.ietf.org/html/rfc7231 ]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests  | 
> http://tools.ietf.org/html/rfc7232 ]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Range Requests | 
> http://tools.ietf.org/html/rfc7233 ]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Caching | 
> http://tools.ietf.org/html/rfc7234 ]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Authentication | 
> http://tools.ietf.org/html/rfc7235 ]
>    
> h3. HTTP PATCH RFC
> RFC ([PATCH Method for HTTP | http://tools.ietf.org/html/rfc5789#section-9.1)
>  provides direct support for updates. 
> Roy Fielding himself said that [PATCH was something he created for the 
> initial HTTP/1.1 proposal because partial PUT is never RESTful | 
> https://twitter.com/fielding/status/275471320685367296 ]. With HTTP PATCH  
> you are not transferring a complete representation, but REST does not require 
> representations to be complete anyway. 
> The method PATCH is not idempotent. With the proposed enhancement, we can now 
> formalize the behavior and provide feedback to the Web standard RFC.
> •     If the update can be carried out in-place, it is idempotent.
> •     If the update causes new data (first entry marked as delete along with 
> corresponding insert/append), then it is not idempotent.
> h3. JSON
> Some RFCs for JSON are given hereunder.
> •     [JavaScript Object Notation (JSON) Patch | 
> http://tools.ietf.org/html/rfc6902 ]
> •     [JSON Merge Patch | https://tools.ietf.org/html/rfc7386 ]
> h3. RDF
> RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
> RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
> The simplest triple statement is a sequence of (subject, predicate, object) 
> terms, separated by whitespace and terminated by '.' after each triple.
> h2. Mobile Apps Data and Resources
> With the enhancements proposed, in addition to the Web, Apps Data and 
> Resources can also be managed using the Hadoop . Some examples of such usage 
> can include App Data and Resources for Apple and other App stores.
> About Apps Resources: 
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
>  
> On-Demand Resources Essentials: 
> https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
>  
> Resource Programming Guide: 
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
>  
> h2. Natural Support for ETL and Analytics
> With native support for updates and deletes in addition to appends/inserts, 
> Hadoop will have proper and natural support for ETL and Analytics.
> h2. Key-Value Store
> With the proposed enhancements, it will become very convenient to implement 
> Key-Value Store natively in Hadoop.
> h2. MVCC (Multi Version Concurrency Control)
> Modified example of how MVCC can be implemented with the proposed 
> enhancements from PostgreSQL MVCC is given hereunder. 
> https://wiki.postgresql.org/wiki/MVCC   
> http://momjian.us/main/writings/pgsql/mvcc.pdf    
> || Data ID || Activity || Data Create Counter || Data Expiry Counter || 
> Comments ||
> | 1   | Insert        | 40    | MAX_VAL       | Conventionally MAX_VAL is 
> null. 
> In order to maintain update size, MAX_VAL is pre-seeded for our purposes. |
> | 1   | Delete        | 40    | 47 | Marked as delete when current counter 
> was 47. |
> | 2   | Update (old Delete)   | 64    | 78    | Mark old data is DELETE |
> | 2   | Update (new insert)   | 78    | MAX_VAL       | Insert new data. |
> h2. Graph Stores
> Enable native storage and processing for a variety of graph stores. 
> h3. Graph Store 1 (Spark GraphX)
> 1. EdgeTable(pid, src, dst, data): stores the adjacency 
> structure and edge data. Each edge is represented as a
> tuple consisting of the source vertex id, destination vertex id,
> and user-defined data as well as a virtual partition identifier
> (pid). Note that the edge table contains only the vertex ids
> and not the vertex data. The edge table is partitioned by the
> pid
> 2. VertexDataTable(id, data): stores the vertex data,
> in the form of a vertex (id, data) pairs. The vertex data table
> is indexed and partitioned by the vertex id.
> 3. VertexMap(id, pid): provides a mapping from the id
> of a vertex to the ids of the virtual partitions that contain
> adjacent edges.  
> h3. Graph Store 2 (Facebook Social Graph - TAO)
> Object:  (id) → (otype,(key → value)∗ )
> Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
> TAO: Facebook’s Distributed Data Store for the Social Graph 
> https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson
>  
> https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf
>  
> TAO: The power of the graph
> https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-thegraph/10151525983993920
>  
> h2. Temporal Data 
> https://en.wikipedia.org/wiki/Temporal_database 
> https://en.wikipedia.org/wiki/Valid_time 
> In temporal data, data may get updated to reflect changes in data.
> For example data change from 
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
> to
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
> Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
> Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
> h2. Media
> Media production typically involves a lot of changes and updates prior to 
> release. The enhancements will lay a basis for the full lifecycle to be 
> managed in Hadoop ecosystem. 
> h2. Indexes
> With the changes, a variety of updatable indexes can be supported natively in 
> Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn 
> leverage Hadoop’s enhanced native capabilities. 
> h2. Google References
> While Google’s research in this area is interesting (and some extracts are 
> listed hereunder), the evolution of Hadoop is quite interesting. Proposed 
> enhancements to support in-place-update to the core Hadoop will enable and 
> make it easier for a variety of enhancements for each of the Hadoop 
> components and has a variety of influences as has been indicated in this JIRA.
> We propose a basis for allowing a system for incrementally processing updates 
> to large data sets and reduce the overhead of always having to do large 
> batches. Hadoop engines can dynamically choose processing style to use based 
> on the type of data and volume of data sets and enhance/replace prevailing 
> approaches.
> || Year       || Title        || Links ||
> | 2015        | Announcing Google Cloud Bigtable: The same database that 
> powers Google Search, Gmail and Analytics is now available on Google Cloud 
> Platform 
> | 
> http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
> https://cloud.google.com/bigtable/ 
> | 2014        | Mesa: Geo-Replicated, Near Real-Time, Scalable Data 
> Warehousing       | 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
>  
> | 2013        | F1: A Distributed SQL Database That Scales    | 
> http://research.google.com/pubs/pub41344.html 
> | 2013        | Online, Asynchronous Schema Change in F1      | 
> http://research.google.com/pubs/pub41376.html 
> | 2013        | Photon: Fault-tolerant and Scalable Joining of Continuous 
> Data Streams        | 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
>  
> | 2012        | F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's 
> Ad Business     | http://research.google.com/pubs/pub38125.html 
> | 2012        | Spanner: Google's Globally-Distributed Database       | 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
>  
> | 2012        | Clydesdale: structured data processing on MapReduce   | 
> http://dl.acm.org/citation.cfm?doid=2247596.2247600 
> | 2011        | Megastore: Providing Scalable, Highly Available Storage for 
> Interactive Services      | http://research.google.com/pubs/pub36971.html 
> | 2011        | Tenzing A SQL Implementation On The MapReduce Framework       
> | http://research.google.com/pubs/pub37200.html 
> | 2010        | Dremel: Interactive Analysis of Web-Scale Datasets    | 
> http://research.google.com/pubs/pub36632.html 
> | 2010        | FlumeJava: Easy, Efficient Data-Parallel Pipelines    | 
> http://research.google.com/pubs/pub35650.html 
> | 2010        | Percolator: Large-scale Incremental Processing Using 
> Distributed Transactions and Notifications       | 
> http://research.google.com/pubs/pub36726.html
> https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 
> h2.Application Domains
> The enhancements will lay a path for comprehensive support of all application 
> domains in Hadoop. A small collection is given hereunder.
> Data Warehousing and Enhanced ETL processing  
> Supply Chain Planning
> Web Sites 
> Mobile App Stores
> Financials 
> Media 
> Machine Learning
> Social Media
> Enterprise Applications such as ERP, CRM 
> Corresponding umbrella JIRAs can be found for each of the following Hadoop 
> platform components. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HADOOP-12620) Advanced Hadoop Architecture (AHA) - Common

Reply via email to