[jira] Commented: (HBASE-1778) Improve PerformanceEvaluation

Schubert Zhang (JIRA) Thu, 03 Sep 2009 23:13:22 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751326#action_12751326
 ]


Schubert Zhang commented on HBASE-1778:
---------------------------------------

It is because the old code use TextInputFormat.  That's not correct in this 
case. 
The TextInputFormat extends FileInputFormat. 

There are following mistakes on concept-level or implementation-level for 
mapreduce programming.

1. The FileInputFormat treads the input file as its input data, and its 
recordReader reads lines from the input split of this input file. Its 
getSplits() method splits input file base on size in bytes, but in fact, the 
old code want to split the input file base on lines, it is wrong usage here. 

I think the old coder did not understand how the 
FileInputFormat/TextInputFormat  works. I just implement a new PeInputFormat 
and PeInputSplit to avoid above misusage and confusion.

2.  I have not change the total architecture of PerformanceEvaluation to keep 
the current workflow. 
But in fact, in my opinion, the current MapReduce implementation for 
PerformanceEvaluation is not a regular one.

In fact, aside from the above described spliting mechanism,  the real input 
data in this case is not from the input file. It is not data, but only a range 
configuration info in file. So, each data/row to be inserted into HBase table 
should come from the recordReader of InputFormt, and in this case, the 
recordReader should generate sequential or random rows.

In my test, we found the old code sometimes gives wrong number of maps. e.g, I 
set the map number as 40 (4clients*10), but I got 41 splits (maps). It is 
caused by the wrong split method which split the file base on bytes.  After I 
add debug log in getSplits(), I got following wrong splited maps:

split0: startRow=0 perClientRunRows=104857 totalRows=4194304
startRow=1048576 perClientRunRows=104857 totalRows=4194304

split1: startRow=2097152 perClientRunRows=104857 totalRows=4194304

split2: startRow=3145728 perClientRunRows=104857 totalRows=4194304

split3: startRow=4194304 perClientRunRows=104857 totalRows=4194304

split4: startRow=5242880 perClientRunRows=104857 totalRows=4194304

split5: startRow=6291456 perClientRunRows=104857 totalRows=4194304

split6: startRow=7340032 perClientRunRows=104857 totalRows=4194304

split7: startRow=8388608 perClientRunRows=104857 totalRows=4194304

split8: startRow=9437184 perClientRunRows=104857 totalRows=4194304

split9: startRow=10485760 perClientRunRows=104857 totalRows=4194304

split10: startRow=11534336 perClientRunRows=104857 totalRows=4194304

split11: startRow=12582912 perClientRunRows=104857 totalRows=4194304

split12: startRow=13631488 perClientRunRows=104857 totalRows=4194304

split13: startRow=14680064 perClientRunRows=104857 totalRows=4194304

split14: startRow=15728640 perClientRunRows=104857 totalRows=4194304

split15: null

split16: startRow=16777216 perClientRunRows=104857 totalRows=4194304
......
(many other splits are omit here...)

We can see, some splits include two row-ranges, and some splits have nothing.





> Improve PerformanceEvaluation
> -----------------------------
>
>                 Key: HBASE-1778
>                 URL: https://issues.apache.org/jira/browse/HBASE-1778
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: test
>    Affects Versions: 0.20.0
>            Reporter: Schubert Zhang
>            Assignee: Schubert Zhang
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HBase-0.20.0-PE.pdf, HBASE-1778.patch
>
>
> Current PerformanceEvaluation class have two problems:
> - It is not updated for hadoop-0.20.0. 
> - The approach to split maps is not strict. Need to provide correct 
> InputSplit and InputFormat classes. Current code uses TextInputFormat and 
> FileSplit, it is not reasonable.
> We will fix these problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1778) Improve PerformanceEvaluation

Reply via email to