[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889745#comment-13889745
 ] 

Enis Soztutar commented on HBASE-10413:
---------------------------------------

Good work, but I am afraid it is hacky to go around the region server and 
access the files directly even for an estimate of the split. The problem is 
that not all the files of the region reside in the same region directory (split 
daughters are referring to parent's files, etc) and we want to encourage 
encapsulating the FS layout in the region / region server layer. 

For a large number of regions, this will also slow down the job submission. A 
better way might be to ask the master about the estimated sizes for the Scan 
range for the input split. The region servers periodically send the a heartbeat 
containing info about their regions (ServerLoad / RegionLoad). The master then 
might answer that request from the latest known sizes. 

> Tablesplit.getLength returns 0
> ------------------------------
>
>                 Key: HBASE-10413
>                 URL: https://issues.apache.org/jira/browse/HBASE-10413
>             Project: HBase
>          Issue Type: Bug
>          Components: Client, mapreduce
>    Affects Versions: 0.96.1.1
>            Reporter: Lukas Nalezenec
>            Assignee: Lukas Nalezenec
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
>     // Not clear how to obtain this... seems to be used only for sorting 
> splits
>     return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to