[
https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012085#comment-14012085
]
deepankar commented on HBASE-8369:
----------------------------------
In the patch that is merged in the trunk (hbase-8369_v9.patch and also on the
head in github), I think there is a very minor bug (or a mistake in my
understanding). In the below function
{code:java}
public static void setInput(Job job, String snapshotName, Path restoreDir)
throws IOException {
{code}
to restore the snapshot we do this
{code:java}
restoreDir = new Path(restoreDir, UUID.randomUUID().toString());
// TODO: restore from record readers to parallelize.
RestoreSnapshotHelper.restoreSnapshotForScanner(conf, fs, rootDir, restoreDir,
snapshotName);
conf.set(TABLE_DIR_KEY, restoreDir.toString());
{code}
I think the restoreDir here will be the root directory of the restored snapshot
and hence the conf.set should not set restoreDir as the
tableDir, rather it should be doing the following
{code:java}
conf.set(TABLE_DIR_KEY, FSUtils.getTableDir(rootDir, tableDesc.getTableName()));
{code}
Can somebody correct me if I am wrong ?
> MapReduce over snapshot files
> -----------------------------
>
> Key: HBASE-8369
> URL: https://issues.apache.org/jira/browse/HBASE-8369
> Project: HBase
> Issue Type: New Feature
> Components: mapreduce, snapshots
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 0.98.0
>
> Attachments: HBASE-8369-0.94.patch, HBASE-8369-0.94_v2.patch,
> HBASE-8369-0.94_v3.patch, HBASE-8369-0.94_v4.patch, HBASE-8369-0.94_v5.patch,
> HBASE-8369-trunk_v1.patch, HBASE-8369-trunk_v2.patch,
> HBASE-8369-trunk_v3.patch, hbase-8369_v0.patch, hbase-8369_v11.patch,
> hbase-8369_v5.patch, hbase-8369_v6.patch, hbase-8369_v7.patch,
> hbase-8369_v8.patch, hbase-8369_v9.patch
>
>
> The idea is to add an InputFormat, which can run the mapreduce job over
> snapshot files directly bypassing hbase server layer. The IF is similar in
> usage to TableInputFormat, taking a Scan object from the user, but instead of
> running from an online table, it runs from a table snapshot. We do one split
> per region in the snapshot, and open an HRegion inside the RecordReader. A
> RegionScanner is used internally for doing the scan without any HRegionServer
> bits.
> Users have been asking and searching for ways to run MR jobs by reading
> directly from hfiles, so this allows new use cases if reading from stale data
> is ok:
> - Take snapshots periodically, and run MR jobs only on snapshots.
> - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster
> without HBase cluster.
> - (Future use case) Combine snapshot data with online hbase data: Scan from
> yesterday's snapshot, but read today's data from online hbase cluster.
--
This message was sent by Atlassian JIRA
(v6.2#6252)