[
https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848481#comment-13848481
]
Enis Soztutar commented on HBASE-8369:
--------------------------------------
bq. Maybe Enis Soztutar can mention the logic on why for some of these kinds of
things?
These are the list of high level things in the final version(v11) of the patch,
which are different from Bryan's version (trunk-v3)
- ClientScanner / AbstractClientScanner / TableRecordReaderImpl changes: the
ClientSideRegion scanner keeps track of ScanMetrics, and exports those via MR
job counters or Scan.
- CellUtil changes : these are at a different place in Bryan's patch.
- PB of MR data
- HDFSBlocksDistribution: in v3, we are providing 3 servers with highest
locality to the input split. In v11, we are using all the servers with 80% of
the locality for the top locality server. This ensures better locality.
- ClientSideRegionScanner / TableSnapshotScanner: not present in v3.
ClientSideRegionScanner is an internal class to do the scanning. Both
TableSnapshotScanner and TableSnapshotInputFormat uses it. TableSnapshotScanner
is a client API, to scan snapshots without MR.
- TableMapreduceUtil changes (other than the new method): needed in case
security is enabled. We should not talk with the HBase cluster at all.
- HRegion changes: v3 patch does send the parent dir for the region snapshot
by assuming that table dir is the parent dir of the region dir. We do not want
to make that assumption in trunk.
- RestoreSnapshotHelper / ModifyRegionUtils : code organization
- Other than these, general test, integration test, or performance evaluation
tools.
For 0.94, we can do a less intrusive patch which combines some of the changes
above (like RestoreSnapshotHelper changes going into the new classes), and get
rid of some of the changes like HRegion changes.
> MapReduce over snapshot files
> -----------------------------
>
> Key: HBASE-8369
> URL: https://issues.apache.org/jira/browse/HBASE-8369
> Project: HBase
> Issue Type: New Feature
> Components: mapreduce, snapshots
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.98.0
>
> Attachments: HBASE-8369-0.94.patch, HBASE-8369-0.94_v2.patch,
> HBASE-8369-0.94_v3.patch, HBASE-8369-0.94_v4.patch, HBASE-8369-0.94_v5.patch,
> HBASE-8369-trunk_v1.patch, HBASE-8369-trunk_v2.patch,
> HBASE-8369-trunk_v3.patch, hbase-8369_v0.patch, hbase-8369_v11.patch,
> hbase-8369_v5.patch, hbase-8369_v6.patch, hbase-8369_v7.patch,
> hbase-8369_v8.patch, hbase-8369_v9.patch
>
>
> The idea is to add an InputFormat, which can run the mapreduce job over
> snapshot files directly bypassing hbase server layer. The IF is similar in
> usage to TableInputFormat, taking a Scan object from the user, but instead of
> running from an online table, it runs from a table snapshot. We do one split
> per region in the snapshot, and open an HRegion inside the RecordReader. A
> RegionScanner is used internally for doing the scan without any HRegionServer
> bits.
> Users have been asking and searching for ways to run MR jobs by reading
> directly from hfiles, so this allows new use cases if reading from stale data
> is ok:
> - Take snapshots periodically, and run MR jobs only on snapshots.
> - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster
> without HBase cluster.
> - (Future use case) Combine snapshot data with online hbase data: Scan from
> yesterday's snapshot, but read today's data from online hbase cluster.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)