[ https://issues.apache.org/jira/browse/HBASE-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751326#action_12751326 ]
Schubert Zhang commented on HBASE-1778: --------------------------------------- It is because the old code use TextInputFormat. That's not correct in this case. The TextInputFormat extends FileInputFormat. There are following mistakes on concept-level or implementation-level for mapreduce programming. 1. The FileInputFormat treads the input file as its input data, and its recordReader reads lines from the input split of this input file. Its getSplits() method splits input file base on size in bytes, but in fact, the old code want to split the input file base on lines, it is wrong usage here. I think the old coder did not understand how the FileInputFormat/TextInputFormat works. I just implement a new PeInputFormat and PeInputSplit to avoid above misusage and confusion. 2. I have not change the total architecture of PerformanceEvaluation to keep the current workflow. But in fact, in my opinion, the current MapReduce implementation for PerformanceEvaluation is not a regular one. In fact, aside from the above described spliting mechanism, the real input data in this case is not from the input file. It is not data, but only a range configuration info in file. So, each data/row to be inserted into HBase table should come from the recordReader of InputFormt, and in this case, the recordReader should generate sequential or random rows. In my test, we found the old code sometimes gives wrong number of maps. e.g, I set the map number as 40 (4clients*10), but I got 41 splits (maps). It is caused by the wrong split method which split the file base on bytes. After I add debug log in getSplits(), I got following wrong splited maps: split0: startRow=0 perClientRunRows=104857 totalRows=4194304 startRow=1048576 perClientRunRows=104857 totalRows=4194304 split1: startRow=2097152 perClientRunRows=104857 totalRows=4194304 split2: startRow=3145728 perClientRunRows=104857 totalRows=4194304 split3: startRow=4194304 perClientRunRows=104857 totalRows=4194304 split4: startRow=5242880 perClientRunRows=104857 totalRows=4194304 split5: startRow=6291456 perClientRunRows=104857 totalRows=4194304 split6: startRow=7340032 perClientRunRows=104857 totalRows=4194304 split7: startRow=8388608 perClientRunRows=104857 totalRows=4194304 split8: startRow=9437184 perClientRunRows=104857 totalRows=4194304 split9: startRow=10485760 perClientRunRows=104857 totalRows=4194304 split10: startRow=11534336 perClientRunRows=104857 totalRows=4194304 split11: startRow=12582912 perClientRunRows=104857 totalRows=4194304 split12: startRow=13631488 perClientRunRows=104857 totalRows=4194304 split13: startRow=14680064 perClientRunRows=104857 totalRows=4194304 split14: startRow=15728640 perClientRunRows=104857 totalRows=4194304 split15: null split16: startRow=16777216 perClientRunRows=104857 totalRows=4194304 ...... (many other splits are omit here...) We can see, some splits include two row-ranges, and some splits have nothing. > Improve PerformanceEvaluation > ----------------------------- > > Key: HBASE-1778 > URL: https://issues.apache.org/jira/browse/HBASE-1778 > Project: Hadoop HBase > Issue Type: Improvement > Components: test > Affects Versions: 0.20.0 > Reporter: Schubert Zhang > Assignee: Schubert Zhang > Priority: Minor > Fix For: 0.20.1 > > Attachments: HBase-0.20.0-PE.pdf, HBASE-1778.patch > > > Current PerformanceEvaluation class have two problems: > - It is not updated for hadoop-0.20.0. > - The approach to split maps is not strict. Need to provide correct > InputSplit and InputFormat classes. Current code uses TextInputFormat and > FileSplit, it is not reasonable. > We will fix these problems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.