Enis Soztutar created HBASE-8369:
------------------------------------

             Summary: MapReduce over snapshot files
                 Key: HBASE-8369
                 URL: https://issues.apache.org/jira/browse/HBASE-8369
             Project: HBase
          Issue Type: New Feature
          Components: mapreduce, snapshots
            Reporter: Enis Soztutar
            Assignee: Enis Soztutar
             Fix For: 0.98.0, 0.95.2


The idea is to add an InputFormat, which can run the mapreduce job over 
snapshot files directly bypassing hbase server layer. The IF is similar in 
usage to TableInputFormat, taking a Scan object from the user, but instead of 
running from an online table, it runs from a table snapshot. We do one split 
per region in the snapshot, and open an HRegion inside the RecordReader. A 
RegionScanner is used internally for doing the scan without any HRegionServer 
bits. 

Users have been asking and searching for ways to run MR jobs by reading 
directly from hfiles, so this allows new use cases if reading from stale data 
is ok:
 - Take snapshots periodically, and run MR jobs only on snapshots.
 - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster 
without HBase cluster.
 - (Future use case) Combine snapshot data with online hbase data: Scan from 
yesterday's snapshot, but read today's data from online hbase cluster. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to