[jira] [Commented] (HBASE-8369) MapReduce over snapshot files

deepankar (JIRA) Thu, 29 May 2014 12:48:40 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012769#comment-14012769
 ]


deepankar commented on HBASE-8369:
----------------------------------

Sorry for the confusion, in my comment I meant that 
{code:java}
conf.set(TABLE_DIR_KEY, FSUtils.getTableDir(restoreDir, 
tableDesc.getTableName()));
{code}
This similar  statement in the 0.94 is  
{code:java}
Path tableDir = new Path(restoreDir, htd.getNameAsString());
 conf.set(TABLE_DIR_KEY, tableDir.toString());
{code}

My concern is that instead of setting the TABLE_DIR_KEY to the tableDir under 
restoreDir, you are directly setting it to restoreDir
I mean the tableDir will contain the tablesRegions and simlinks for the HFiles, 
but where as the restoreDir will contain _/data/<namespace>/tableName/_.

As far as my understanding, setting this parameter wrong may not cause any real 
problems but the weight calculation of HDFS blocks
will go wrong and thus leading to non local tasks,

The reason for above is because of the following lines in the record reader
{code:java}
      Path tmpRootDir = new Path(conf.get(TABLE_DIR_KEY)); // This is the user 
specified root
      // directory where snapshot was restored
{code}
so while creating record reader we are using this in the right way i.e as root 
directory

But while using for the calculation of weights in the getInputSplits

{code:java}
    Path tableDir = new Path(conf.get(TABLE_DIR_KEY));

    List<InputSplit> splits = new ArrayList<InputSplit>();
    for (SnapshotRegionManifest regionManifest : regionManifests) {
      // load region descriptor
      HRegionInfo hri = HRegionInfo.convert(regionManifest.getRegionInfo());

      if (CellUtil.overlappingKeys(scan.getStartRow(), scan.getStopRow(),
        hri.getStartKey(), hri.getEndKey())) {
        // compute HDFS locations from snapshot files (which will get the 
locations for
        // referred hfiles)
        List<String> hosts = getBestLocations(conf,
          HRegion.computeHDFSBlocksDistribution(conf, htd, hri, tableDir));

{code}

we are sending the rootDir from the {{conf.get(TABLE_DIR_KEY)}} as the tableDir 
in the {{HRegion.computeHDFSBlocksDistribution(conf, htd, hri, tableDir)}} This 
will lead to returning a empty **HDFSBlocksDistribution**
class which will lead to scheduling of non local tasks.

Again there might be some mistake in my understanding, if so please correct me






> MapReduce over snapshot files
> -----------------------------
>
>                 Key: HBASE-8369
>                 URL: https://issues.apache.org/jira/browse/HBASE-8369
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce, snapshots
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0
>
>         Attachments: HBASE-8369-0.94.patch, HBASE-8369-0.94_v2.patch, 
> HBASE-8369-0.94_v3.patch, HBASE-8369-0.94_v4.patch, HBASE-8369-0.94_v5.patch, 
> HBASE-8369-trunk_v1.patch, HBASE-8369-trunk_v2.patch, 
> HBASE-8369-trunk_v3.patch, hbase-8369_v0.patch, hbase-8369_v11.patch, 
> hbase-8369_v5.patch, hbase-8369_v6.patch, hbase-8369_v7.patch, 
> hbase-8369_v8.patch, hbase-8369_v9.patch
>
>
> The idea is to add an InputFormat, which can run the mapreduce job over 
> snapshot files directly bypassing hbase server layer. The IF is similar in 
> usage to TableInputFormat, taking a Scan object from the user, but instead of 
> running from an online table, it runs from a table snapshot. We do one split 
> per region in the snapshot, and open an HRegion inside the RecordReader. A 
> RegionScanner is used internally for doing the scan without any HRegionServer 
> bits. 
> Users have been asking and searching for ways to run MR jobs by reading 
> directly from hfiles, so this allows new use cases if reading from stale data 
> is ok:
>  - Take snapshots periodically, and run MR jobs only on snapshots.
>  - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster 
> without HBase cluster.
>  - (Future use case) Combine snapshot data with online hbase data: Scan from 
> yesterday's snapshot, but read today's data from online hbase cluster. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HBASE-8369) MapReduce over snapshot files

Reply via email to