[ 
https://issues.apache.org/jira/browse/HBASE-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895729#comment-13895729
 ] 

Lukas Nalezenec commented on HBASE-10413:
-----------------------------------------

Lets make RegionSizeCalculator @InterfaceAudience.Private. Users are not 
expected to directly call this, right?
 - I am not sure - I have no experience with using this interface 
InterfaceAudience. Lot of developers are using heavily customized 
TableInputFormat. They may want to use this class.  I have changed it to 
Private (Btw: I was told to change it from Private to Public in previous code 
review ).

Instead of TableSplit.setLength(), you can override the ctor. TableSplit acts 
like a immutable data bean like object.
 - It means there will be ctor with 6 parameters. IMO it is too much but if you 
really want me to do it I will.

 On some cases, the regions might split or merge concurrently between getting 
the startEndKeys and asking the regions from cluster. In this case, for that 
range, we might default to 0, but it should be ok I think. We are not just 
estimating the region sizes here.
 - I think its not worth doing - it will be rare and the difference will be 
insignificant most times.


> Tablesplit.getLength returns 0
> ------------------------------
>
>                 Key: HBASE-10413
>                 URL: https://issues.apache.org/jira/browse/HBASE-10413
>             Project: HBase
>          Issue Type: Bug
>          Components: Client, mapreduce
>    Affects Versions: 0.96.1.1
>            Reporter: Lukas Nalezenec
>            Assignee: Lukas Nalezenec
>         Attachments: HBASE-10413-2.patch, HBASE-10413-3.patch, 
> HBASE-10413-4.patch, HBASE-10413.patch
>
>
> InputSplits should be sorted by length but TableSplit does not contain real 
> getLength implementation:
>   @Override
>   public long getLength() {
>     // Not clear how to obtain this... seems to be used only for sorting 
> splits
>     return 0;
>   }
> This is causing us problem with scheduling - we have got jobs that are 
> supposed to finish in limited time but they get often stuck in last mapper 
> working on large region.
> Can we implement this method ? 
> What is the best way ?
> We were thinking about estimating size by size of files on HDFS.
> We would like to get Scanner from TableSplit, use startRow, stopRow and 
> column families to get corresponding region than computing size of HDFS for 
> given region and column family. 
> Update:
> This ticket was about production issue - I talked with guy who worked on this 
> and he said our production issue was probably not directly caused by 
> getLength() returning 0. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to