[jira] [Updated] (HADOOP-12620) Advanced Hadoop Architecture (AHA) - Common

Dinesh S. Atreya (JIRA) Mon, 07 Dec 2015 13:23:33 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dinesh S. Atreya updated HADOOP-12620:
--------------------------------------
    Description: 
h1. Advance Hadoop Architecture (AHA) / Advance Hadoop Adaptabilities (AHA)

One main motivation for this JIRA is to address a comprehensive set of uses 
with just minimal enhancements to Hadoop to transition Hadoop to Advanced/Cloud 
Data Architecture. 

HDFS has traditionally had a write-once-read-many access model for files until  
“[Append to files in HDFS | https://issues.apache.org/jira/browse/HADOOP-1700 
]”  capability was introduced. The next minimal enhancements to core Hadoop 
include capability to do “updates-in-place” in HDFS. 
•       Support seeks for writes (in addition to reads).
•       After seek, if the new byte length is the same as the old byte length, 
in place update is allowed.
•       Delete is an update with appropriate Delete marker
•       If byte length is different, old entry is marked as delete with new one 
appended as before. 
•       It is the client’s discretion to perform either update, append or both 
and the API changes in different Hadoop components should provide these 
capabilities.

Please note that this JIRA is limited to essentially a specific type of 
updates, in-place updates that do not change the byte length (e.g., buffer 
spaces are included in the length).  Updates that change the byte length are 
not-supported in-place and are considered as Appends/Inserts. Similarly Deletes 
that create holes are not supported. The reason is simple, fragmentations and 
holes cause performance penalties and make the process complicated and may 
involve a lot of changes to Hadoop and are out-of-scope.

These minimal changes will enable laying the basis for transforming the core 
Hadoop to an interactive and real-time platform and introducing significant 
native capabilities to Hadoop. These enhancements will lay a foundation for all 
of the following processing styles to be supported natively and dynamically. 
•       Real time 
•       Mini-batch  
•       Stream based data processing
•       Batch – which is the default now.
Hadoop engines can dynamically choose processing style to use based on the type 
of data and volume of data sets and enhance/replace prevailing approaches.

With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O 
resources  with increasing efficiency. The Hadoop task engines can use 
vectorized/pipelined processing and greater use of memory throughout the Hadoop 
platform. 

These will enable enhanced performance optimizations to be implemented in HDFS 
and made available to all the Hadoop components. This will enable Fast 
processing of Big Data and enhance all the characteristics volume, velocity and 
variety of big data.

There are many influences for this umbrella JIRA:

•       Preserve and Accelerate Hadoop
•       Efficient Data Management of variety of Data Formats natively in Hadoop
•       Enterprise Expansion 
•       Internet and Media 
•       Databases offer native support for a variety of Data Formats such as 
JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.

It is quite probable that there may be many sub-JIRAs created to address 
portions of this. This JIRA captures a variety of use-cases in one place.  Some 
Data Management /Platform initial use-cases are given hereunder.

h2. WEB
With the AHA (Advance Hadoop Architecture) enhancements, a variety of Web 
standards can be natively supported  such as updateable JSON 
[http://json.org/], XML, RDF and other documents.

While Hadoop origination can be traced to the WEB, some of the [Web standards | 
http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html] are not completely 
supported natively in Hadoop such as HTTP PUT and PATCH (PUT and POST are only 
partially supported in terms of creation). With the proposed enhancement all of 
the standards POST, PUT and PATCH (new addition to Web standards) can be 
natively completely supported (in addition to GET) through Hadoop. 

Hypertext Transfer Protocol -- HTTP/1.1 ([Original RFC | 
http://tools.ietf.org/html/rfc2616], [Current RFC | 
http://tools.ietf.org/html/rfc7231] ) 
Current RFCS:
•       [ Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing | 
http://tools.ietf.org/html/rfc7230]
•       [ Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content | 
http://tools.ietf.org/html/rfc7231 ]
•       [ Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests  | 
http://tools.ietf.org/html/rfc7232 ]
•       [ Hypertext Transfer Protocol (HTTP/1.1): Range Requests | 
http://tools.ietf.org/html/rfc7233 ]
•       [ Hypertext Transfer Protocol (HTTP/1.1): Caching | 
http://tools.ietf.org/html/rfc7234 ]
•       [ Hypertext Transfer Protocol (HTTP/1.1): Authentication | 
http://tools.ietf.org/html/rfc7235 ]






   
h3. HTTP PATCH RFC

RFC ([PATCH Method for HTTP | http://tools.ietf.org/html/rfc5789#section-9.1)
 provides direct support for updates. 

Roy Fielding himself said that [PATCH was something he created for the initial 
HTTP/1.1 proposal because partial PUT is never RESTful | 
https://twitter.com/fielding/status/275471320685367296 ]. With HTTP PATCH  you 
are not transferring a complete representation, but REST does not require 
representations to be complete anyway. 

The method PATCH is not idempotent. With the proposed enhancement, we can now 
formalize the behavior and provide feedback to the Web standard RFC.

•       If the update can be carried out in-place, it is idempotent.
•       If the update causes new data (first entry marked as delete along with 
corresponding insert/append), then it is not idempotent.
h3. JSON

Some RFCs for JSON are given hereunder.
•       [JavaScript Object Notation (JSON) Patch | 
http://tools.ietf.org/html/rfc6902 ]
•       [JSON Merge Patch | https://tools.ietf.org/html/rfc7386 ]


h3. RDF
RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
The simplest triple statement is a sequence of (subject, predicate, object) 
terms, separated by whitespace and terminated by '.' after each triple.

h2. Mobile Apps Data and Resources

With the enhancements proposed, in addition to the Web, Apps Data and Resources 
can also be managed using the Hadoop . Some examples of such usage can include 
App Data and Resources for Apple and other App stores.

About Apps Resources: 
https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
 
On-Demand Resources Essentials: 
https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
 
Resource Programming Guide: 
https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
 

h2. Natural Support for ETL and Analytics
With native support for updates and deletes in addition to appends/inserts, 
Hadoop will have proper and natural support for ETL and Analytics.

h2. Key-Value Store
With the proposed enhancements, it will become very convenient to implement 
Key-Value Store natively in Hadoop.

h2. MVCC (Multi Version Concurrency Control)

Modified example of how MVCC can be implemented with the proposed enhancements 
from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC  
 
http://momjian.us/main/writings/pgsql/mvcc.pdf    


|| Data ID
        || Activity
        || Data Create Counter  || Data Expiry Counter  || Comments
| 1
        | Insert        | 40    | MAX_VAL       | Conventionally MAX_VAL is 
null.
In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
| 1     | Delete        | 40    | 47    | Marked as delete when current counter 
was 47.
| 2     | Update (old Delete)   | 64    | 78    | Mark old data is DELETE
| 2     | Update (new insert)   | 78    | MAX_VAL       | Insert new data.


h2. Graph Stores
Enable native storage and processing for a variety of graph stores. 

h3. Graph Store 1 (Spark GraphX)
1. EdgeTable(pid, src, dst, data): stores the adjacency 
structure and edge data. Each edge is represented as a
tuple consisting of the source vertex id, destination vertex id,
and user-defined data as well as a virtual partition identifier
(pid). Note that the edge table contains only the vertex ids
and not the vertex data. The edge table is partitioned by the
pid
2. VertexDataTable(id, data): stores the vertex data,
in the form of a vertex (id, data) pairs. The vertex data table
is indexed and partitioned by the vertex id.
3. VertexMap(id, pid): provides a mapping from the id
of a vertex to the ids of the virtual partitions that contain
adjacent edges.  

h3. Graph Store 2 (Facebook Social Graph - TAO)

Object:  (id) → (otype,(key → value)∗ )
Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )

TAO: Facebook’s Distributed Data Store for the Social Graph 
https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson 
https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf
 
TAO: The power of the graph
https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-thegraph/10151525983993920
 


h2. Temporal Data 
https://en.wikipedia.org/wiki/Temporal_database 
https://en.wikipedia.org/wiki/Valid_time 
In temporal data, data may get updated to reflect changes in data.
For example data change from 
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
to
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)

h2. Media
Media production typically involves a lot of changes and updates prior to 
release. The enhancements will lay a basis for the full lifecycle to be managed 
in Hadoop ecosystem. 
h2. Indexes
With the changes, a variety of updatable indexes can be supported natively in 
Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn 
leverage Hadoop’s enhanced native capabilities. 


h2. Google References

While Google’s research in this area is interesting (and some extracts are 
listed hereunder), the evolution of Hadoop is quite interesting. Proposed 
enhancements to support in-place-update to the core Hadoop will enable and make 
it easier for a variety of enhancements for each of the Hadoop components and 
has a variety of influences as has been indicated in this JIRA.

We propose a basis for allowing a system for incrementally processing updates 
to large data sets and reduce the overhead of always having to do large 
batches. Hadoop engines can dynamically choose processing style to use based on 
the type of data and volume of data sets and enhance/replace prevailing 
approaches.


|| Year || Title        || Links
| 2015  | Announcing Google Cloud Bigtable: The same database that powers 
Google Search, Gmail and Analytics is now available on Google Cloud Platform 
| 
http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
https://cloud.google.com/bigtable/ 
| 2014  | Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing       
| 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
 
| 2013  | F1: A Distributed SQL Database That Scales    | 
http://research.google.com/pubs/pub41344.html 
| 2013  | Online, Asynchronous Schema Change in F1      | 
http://research.google.com/pubs/pub41376.html 
| 2013  | Photon: Fault-tolerant and Scalable Joining of Continuous Data 
Streams        | 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
 
| 2012  | F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad 
Business     | http://research.google.com/pubs/pub38125.html 
| 2012  | Spanner: Google's Globally-Distributed Database       | 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
 
| 2012  | Clydesdale: structured data processing on MapReduce   | 
http://dl.acm.org/citation.cfm?doid=2247596.2247600 
| 2011  | Megastore: Providing Scalable, Highly Available Storage for 
Interactive Services      | http://research.google.com/pubs/pub36971.html 
| 2011  | Tenzing A SQL Implementation On The MapReduce Framework       | 
http://research.google.com/pubs/pub37200.html 
| 2010  | Dremel: Interactive Analysis of Web-Scale Datasets    | 
http://research.google.com/pubs/pub36632.html 
| 2010  | FlumeJava: Easy, Efficient Data-Parallel Pipelines    | 
http://research.google.com/pubs/pub35650.html 
| 2010  | Percolator: Large-scale Incremental Processing Using Distributed 
Transactions and Notifications       | 
http://research.google.com/pubs/pub36726.html
https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 

h2.Application Domains

The enhancements will lay a path for comprehensive support of all application 
domains in Hadoop. A small collection is given hereunder.

Data Warehousing and Enhanced ETL processing  
Supply Chain Planning
Web Sites 
Mobile App Stores
Financials 
Media 
Machine Learning
Social Media
Enterprise Applications such as ERP, CRM 


Corresponding umbrella JIRAs can be found for each of the following Hadoop 
platform components. 




  was:
h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)

One main motivation for this JIRA is to address a comprehensive set of uses 
with just minimal enhancements to Hadoop to transition Hadoop from a Modern 
Data Architecture to Advanced/Cloud Data Architecture. 

HDFS has traditionally had a write-once-read-many access model for files until 
the introduction of “Append to files in HDFS” capability. The next minimal 
enhancements to core Hadoop include capability to do “updates-in-place” in 
HDFS. 
•       Support seeks for writes (in addition to reads).
•       After seek, if the new byte length is the same as the old byte length, 
in place update is allowed.
•       Delete is an update with appropriate Delete marker
•       If byte length is different, old entry is marked as delete with new one 
appended as before. 
•       It is client’s discretion to perform either update, append or both and 
the API changes in different Hadoop components should provide these 
capabilities.

These minimal changes will enable laying the basis for transforming the core 
Hadoop to an interactive and real-time platform and introducing significant 
native capabilities to Hadoop. These enhancements will lay a foundation for all 
of the following processing styles to be supported natively and dynamically. 
•       Real time 
•       Mini-batch  
•       Stream based data processing
•       Batch – which is the default now.
Hadoop engines can dynamically choose processing style to use based on the type 
of data and volume of data sets and enhance/replace prevailing approaches.

With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O 
resources  with increasing efficiency. The Hadoop task engines can use 
vectorized/pipelined processing and greater use of memory throughout the Hadoop 
platform. 

These will enable enhanced performance optimizations to be implemented in HDFS 
and made available to all the Hadoop components. This will enable Fast 
processing of Big Data and enhance all the characteristics volume, velocity and 
variety of big data.

There are many influences for this umbrella JIRA:

•       Preserve and Accelerate Hadoop
•       Efficient Data Management of variety of Data Formats natively in Hadoop
•       Enterprise Expansion 
•       Internet and Media 
•       Databases offer native support for a variety of Data Formats such as 
JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.

It is quite probable that there may be many sub-JIRAs created to address 
portions of this. This JIRA captures a variety of use-cases in one place.  Some 
Data Management /Platform initial use-cases are given hereunder:

h2. Key-Value Store
With the proposed enhancements, it will become very convenient to implement 
Key-Value Store natively in Hadoop.

h2. MVCC 

Modified example of how MVCC can be implemented with the proposed enhancements 
from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC 
http://momjian.us/main/writings/pgsql/mvcc.pdf 



|| Data ID || Activity || Data Create || Data Expiry || Comments
||               ||             || Counter       ||  Counter    || Comments
| 1  | Insert   | 40    | MAX_VAL       | Conventionally MAX_VAL is null.
In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
| 1     | Delete        | 40    | 47    | Marked as delete when current counter 
was 47.
| 2     | Update (old Delete)   | 64    | 78    | Mark old data is DELETE
| 2     | Update (new insert)   | 78    | MAX_VAL       | Insert new data.




Graph Stores
Enable native storage and processing for a variety of graph stores. 

Graph Store 1 (Spark GraphX)
1. EdgeTable(pid, src, dst, data): stores the adjacency 
structure and edge data. Each edge is represented as a
tuple consisting of the source vertex id, destination vertex id,
and user-defined data as well as a virtual partition identifier
(pid). Note that the edge table contains only the vertex ids
and not the vertex data. The edge table is partitioned by the
pid
2. VertexDataTable(id, data): stores the vertex data,
in the form of a vertex (id, data) pairs. The vertex data table
is indexed and partitioned by the vertex id.
3. VertexMap(id, pid): provides a mapping from the id
of a vertex to the ids of the virtual partitions that contain
adjacent edges.  

Graph Store 2 (Facebook Social Graph - TAO)

Object:  (id) → (otype,(key → value)∗ )
Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )

WEB
With the AHA enhancements, a variety of Web standards can be natively supported 
 such as updateable JSON (http://json.org/), XML, RDF and other documents.


RDF
RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
The simplest triple statement is a sequence of (subject, predicate, object) 
terms, separated by whitespace and terminated by '.' after each triple.

Mobile Apps Data and Resources

With the enhancements proposed, in addition to the Web, Apps Data and Resources 
can also be managed using the Hadoop . Some examples of such usage can include 
App Data and Resources for Apple and other App stores.

About Apps Resources: 
https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
 
On-Demand Resources Essentials: 
https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
 
Resource Programming Guide: 
https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
 



Temporal Data 
https://en.wikipedia.org/wiki/Temporal_database 
https://en.wikipedia.org/wiki/Valid_time 
In temporal data, data may get updated to reflect changes in data.
For example data change from 
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
to
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)

Media
Media production typically involves a lot of changes and updates prior to 
release. The enhancements will lay a basis for the full lifecycle to be managed 
in Hadoop ecosystem. 
Indexes
With the changes, a variety of updatable indexes can be supported natively in 
Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn 
leverage Hadoop’s enhanced native capabilities. 

Natural Support for ETL and Analytics
With native support for updates and deletes in addition to appends/inserts, 
Hadoop will have proper and natural support for ETL and Analytics.

Google References

While Google’s research in this area is interesting (and some extracts are 
listed hereunder), the evolution of Hadoop is quite interesting. Proposed 
enhancements to support in-place-update to the core Hadoop will enable and make 
it easier for a variety of enhancements for each of the Hadoop components.

We propose a basis for allowing a system for incrementally processing updates 
to large data sets and reduce the overhead of always having to do large 
batches. Hadoop engines can dynamically choose processing style to use based on 
the type of data and volume of data sets and enhance/replace prevailing 
approaches.


Year    Title   Links
2015    Announcing Google Cloud Bigtable: The same database that powers Google 
Search, Gmail and Analytics is now available on Google Cloud Platform 
http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
https://cloud.google.com/bigtable/ 
2014    Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
 
2013    F1: A Distributed SQL Database That Scales      
http://research.google.com/pubs/pub41344.html 
2013    Online, Asynchronous Schema Change in F1        
http://research.google.com/pubs/pub41376.html 
2013    Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams  
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
 
2012    F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad 
Business       http://research.google.com/pubs/pub38125.html 
2012    Spanner: Google's Globally-Distributed Database 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
 
2012    Clydesdale: structured data processing on MapReduce     
http://dl.acm.org/citation.cfm?doid=2247596.2247600 
2011    Megastore: Providing Scalable, Highly Available Storage for Interactive 
Services        http://research.google.com/pubs/pub36971.html 
2011    Tenzing A SQL Implementation On The MapReduce Framework 
http://research.google.com/pubs/pub37200.html 
2010    Dremel: Interactive Analysis of Web-Scale Datasets      
http://research.google.com/pubs/pub36632.html 
2010    FlumeJava: Easy, Efficient Data-Parallel Pipelines      
http://research.google.com/pubs/pub35650.html 
2010    Percolator: Large-scale Incremental Processing Using Distributed 
Transactions and Notifications http://research.google.com/pubs/pub36726.html
https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 

Application Domains

The enhancements will lay a path for comprehensive support of all application 
domains in Hadoop. A small collection is given hereunder.

Data Warehousing and Enhanced ETL processing  
Supply Chain Planning
Web Sites 
Mobile App Stores
Financials 
Media 
Machine Learning
Social Media
Enterprise Applications such as ERP, CRM 


Corresponding umbrella JIRAs can be found for each of the following Hadoop 
platform components. 



> Advanced Hadoop Architecture (AHA) - Common
> -------------------------------------------
>
>                 Key: HADOOP-12620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12620
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Dinesh S. Atreya
>
> h1. Advance Hadoop Architecture (AHA) / Advance Hadoop Adaptabilities (AHA)
> One main motivation for this JIRA is to address a comprehensive set of uses 
> with just minimal enhancements to Hadoop to transition Hadoop to 
> Advanced/Cloud Data Architecture. 
> HDFS has traditionally had a write-once-read-many access model for files 
> until  “[Append to files in HDFS | 
> https://issues.apache.org/jira/browse/HADOOP-1700 ]”  capability was 
> introduced. The next minimal enhancements to core Hadoop include capability 
> to do “updates-in-place” in HDFS. 
> •     Support seeks for writes (in addition to reads).
> •     After seek, if the new byte length is the same as the old byte length, 
> in place update is allowed.
> •     Delete is an update with appropriate Delete marker
> •     If byte length is different, old entry is marked as delete with new one 
> appended as before. 
> •     It is the client’s discretion to perform either update, append or both 
> and the API changes in different Hadoop components should provide these 
> capabilities.
> Please note that this JIRA is limited to essentially a specific type of 
> updates, in-place updates that do not change the byte length (e.g., buffer 
> spaces are included in the length).  Updates that change the byte length are 
> not-supported in-place and are considered as Appends/Inserts. Similarly 
> Deletes that create holes are not supported. The reason is simple, 
> fragmentations and holes cause performance penalties and make the process 
> complicated and may involve a lot of changes to Hadoop and are out-of-scope.
> These minimal changes will enable laying the basis for transforming the core 
> Hadoop to an interactive and real-time platform and introducing significant 
> native capabilities to Hadoop. These enhancements will lay a foundation for 
> all of the following processing styles to be supported natively and 
> dynamically. 
> •     Real time 
> •     Mini-batch  
> •     Stream based data processing
> •     Batch – which is the default now.
> Hadoop engines can dynamically choose processing style to use based on the 
> type of data and volume of data sets and enhance/replace prevailing 
> approaches.
> With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O 
> resources  with increasing efficiency. The Hadoop task engines can use 
> vectorized/pipelined processing and greater use of memory throughout the 
> Hadoop platform. 
> These will enable enhanced performance optimizations to be implemented in 
> HDFS and made available to all the Hadoop components. This will enable Fast 
> processing of Big Data and enhance all the characteristics volume, velocity 
> and variety of big data.
> There are many influences for this umbrella JIRA:
> •     Preserve and Accelerate Hadoop
> •     Efficient Data Management of variety of Data Formats natively in Hadoop
> •     Enterprise Expansion 
> •     Internet and Media 
> •     Databases offer native support for a variety of Data Formats such as 
> JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.
> It is quite probable that there may be many sub-JIRAs created to address 
> portions of this. This JIRA captures a variety of use-cases in one place.  
> Some Data Management /Platform initial use-cases are given hereunder.
> h2. WEB
> With the AHA (Advance Hadoop Architecture) enhancements, a variety of Web 
> standards can be natively supported  such as updateable JSON 
> [http://json.org/], XML, RDF and other documents.
> While Hadoop origination can be traced to the WEB, some of the [Web standards 
> | http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html] are not completely 
> supported natively in Hadoop such as HTTP PUT and PATCH (PUT and POST are 
> only partially supported in terms of creation). With the proposed enhancement 
> all of the standards POST, PUT and PATCH (new addition to Web standards) can 
> be natively completely supported (in addition to GET) through Hadoop. 
> Hypertext Transfer Protocol -- HTTP/1.1 ([Original RFC | 
> http://tools.ietf.org/html/rfc2616], [Current RFC | 
> http://tools.ietf.org/html/rfc7231] ) 
> Current RFCS:
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing | 
> http://tools.ietf.org/html/rfc7230]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content | 
> http://tools.ietf.org/html/rfc7231 ]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests  | 
> http://tools.ietf.org/html/rfc7232 ]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Range Requests | 
> http://tools.ietf.org/html/rfc7233 ]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Caching | 
> http://tools.ietf.org/html/rfc7234 ]
> •     [ Hypertext Transfer Protocol (HTTP/1.1): Authentication | 
> http://tools.ietf.org/html/rfc7235 ]
>    
> h3. HTTP PATCH RFC
> RFC ([PATCH Method for HTTP | http://tools.ietf.org/html/rfc5789#section-9.1)
>  provides direct support for updates. 
> Roy Fielding himself said that [PATCH was something he created for the 
> initial HTTP/1.1 proposal because partial PUT is never RESTful | 
> https://twitter.com/fielding/status/275471320685367296 ]. With HTTP PATCH  
> you are not transferring a complete representation, but REST does not require 
> representations to be complete anyway. 
> The method PATCH is not idempotent. With the proposed enhancement, we can now 
> formalize the behavior and provide feedback to the Web standard RFC.
> •     If the update can be carried out in-place, it is idempotent.
> •     If the update causes new data (first entry marked as delete along with 
> corresponding insert/append), then it is not idempotent.
> h3. JSON
> Some RFCs for JSON are given hereunder.
> •     [JavaScript Object Notation (JSON) Patch | 
> http://tools.ietf.org/html/rfc6902 ]
> •     [JSON Merge Patch | https://tools.ietf.org/html/rfc7386 ]
> h3. RDF
> RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
> RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
> The simplest triple statement is a sequence of (subject, predicate, object) 
> terms, separated by whitespace and terminated by '.' after each triple.
> h2. Mobile Apps Data and Resources
> With the enhancements proposed, in addition to the Web, Apps Data and 
> Resources can also be managed using the Hadoop . Some examples of such usage 
> can include App Data and Resources for Apple and other App stores.
> About Apps Resources: 
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
>  
> On-Demand Resources Essentials: 
> https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
>  
> Resource Programming Guide: 
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
>  
> h2. Natural Support for ETL and Analytics
> With native support for updates and deletes in addition to appends/inserts, 
> Hadoop will have proper and natural support for ETL and Analytics.
> h2. Key-Value Store
> With the proposed enhancements, it will become very convenient to implement 
> Key-Value Store natively in Hadoop.
> h2. MVCC (Multi Version Concurrency Control)
> Modified example of how MVCC can be implemented with the proposed 
> enhancements from PostgreSQL MVCC is given hereunder. 
> https://wiki.postgresql.org/wiki/MVCC   
> http://momjian.us/main/writings/pgsql/mvcc.pdf    
> || Data ID
>       || Activity
>       || Data Create Counter  || Data Expiry Counter  || Comments
> | 1
>       | Insert        | 40    | MAX_VAL       | Conventionally MAX_VAL is 
> null.
> In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
> | 1   | Delete        | 40    | 47    | Marked as delete when current counter 
> was 47.
> | 2   | Update (old Delete)   | 64    | 78    | Mark old data is DELETE
> | 2   | Update (new insert)   | 78    | MAX_VAL       | Insert new data.
> h2. Graph Stores
> Enable native storage and processing for a variety of graph stores. 
> h3. Graph Store 1 (Spark GraphX)
> 1. EdgeTable(pid, src, dst, data): stores the adjacency 
> structure and edge data. Each edge is represented as a
> tuple consisting of the source vertex id, destination vertex id,
> and user-defined data as well as a virtual partition identifier
> (pid). Note that the edge table contains only the vertex ids
> and not the vertex data. The edge table is partitioned by the
> pid
> 2. VertexDataTable(id, data): stores the vertex data,
> in the form of a vertex (id, data) pairs. The vertex data table
> is indexed and partitioned by the vertex id.
> 3. VertexMap(id, pid): provides a mapping from the id
> of a vertex to the ids of the virtual partitions that contain
> adjacent edges.  
> h3. Graph Store 2 (Facebook Social Graph - TAO)
> Object:  (id) → (otype,(key → value)∗ )
> Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
> TAO: Facebook’s Distributed Data Store for the Social Graph 
> https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson
>  
> https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf
>  
> TAO: The power of the graph
> https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-thegraph/10151525983993920
>  
> h2. Temporal Data 
> https://en.wikipedia.org/wiki/Temporal_database 
> https://en.wikipedia.org/wiki/Valid_time 
> In temporal data, data may get updated to reflect changes in data.
> For example data change from 
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
> to
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
> Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
> Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
> h2. Media
> Media production typically involves a lot of changes and updates prior to 
> release. The enhancements will lay a basis for the full lifecycle to be 
> managed in Hadoop ecosystem. 
> h2. Indexes
> With the changes, a variety of updatable indexes can be supported natively in 
> Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn 
> leverage Hadoop’s enhanced native capabilities. 
> h2. Google References
> While Google’s research in this area is interesting (and some extracts are 
> listed hereunder), the evolution of Hadoop is quite interesting. Proposed 
> enhancements to support in-place-update to the core Hadoop will enable and 
> make it easier for a variety of enhancements for each of the Hadoop 
> components and has a variety of influences as has been indicated in this JIRA.
> We propose a basis for allowing a system for incrementally processing updates 
> to large data sets and reduce the overhead of always having to do large 
> batches. Hadoop engines can dynamically choose processing style to use based 
> on the type of data and volume of data sets and enhance/replace prevailing 
> approaches.
> || Year       || Title        || Links
> | 2015        | Announcing Google Cloud Bigtable: The same database that 
> powers Google Search, Gmail and Analytics is now available on Google Cloud 
> Platform 
> | 
> http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
> https://cloud.google.com/bigtable/ 
> | 2014        | Mesa: Geo-Replicated, Near Real-Time, Scalable Data 
> Warehousing       | 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
>  
> | 2013        | F1: A Distributed SQL Database That Scales    | 
> http://research.google.com/pubs/pub41344.html 
> | 2013        | Online, Asynchronous Schema Change in F1      | 
> http://research.google.com/pubs/pub41376.html 
> | 2013        | Photon: Fault-tolerant and Scalable Joining of Continuous 
> Data Streams        | 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
>  
> | 2012        | F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's 
> Ad Business     | http://research.google.com/pubs/pub38125.html 
> | 2012        | Spanner: Google's Globally-Distributed Database       | 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
>  
> | 2012        | Clydesdale: structured data processing on MapReduce   | 
> http://dl.acm.org/citation.cfm?doid=2247596.2247600 
> | 2011        | Megastore: Providing Scalable, Highly Available Storage for 
> Interactive Services      | http://research.google.com/pubs/pub36971.html 
> | 2011        | Tenzing A SQL Implementation On The MapReduce Framework       
> | http://research.google.com/pubs/pub37200.html 
> | 2010        | Dremel: Interactive Analysis of Web-Scale Datasets    | 
> http://research.google.com/pubs/pub36632.html 
> | 2010        | FlumeJava: Easy, Efficient Data-Parallel Pipelines    | 
> http://research.google.com/pubs/pub35650.html 
> | 2010        | Percolator: Large-scale Incremental Processing Using 
> Distributed Transactions and Notifications       | 
> http://research.google.com/pubs/pub36726.html
> https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 
> h2.Application Domains
> The enhancements will lay a path for comprehensive support of all application 
> domains in Hadoop. A small collection is given hereunder.
> Data Warehousing and Enhanced ETL processing  
> Supply Chain Planning
> Web Sites 
> Mobile App Stores
> Financials 
> Media 
> Machine Learning
> Social Media
> Enterprise Applications such as ERP, CRM 
> Corresponding umbrella JIRAs can be found for each of the following Hadoop 
> platform components. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HADOOP-12620) Advanced Hadoop Architecture (AHA) - Common

Reply via email to