[ 
https://issues.apache.org/jira/browse/HBASE-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Salyakhov updated HBASE-2378:
------------------------------------

    Attachment: MyHFilesWriter.java
                HFileOutputFormat.java
                my_sample_log_1k.txt

> Bulk insert with multiple reducers
> ----------------------------------
>
>                 Key: HBASE-2378
>                 URL: https://issues.apache.org/jira/browse/HBASE-2378
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.20.3
>            Reporter: Ruslan Salyakhov
>         Attachments: HFileOutputFormat.java, my_sample_log_1k.txt, 
> MyHFilesWriter.java, MyKeyComparator.java, MySampler.java, 
> TestTotalOrderPartitionerForMyKeys.java, TotalOrderPartitioner.java
>
>
> If I run MR to prepare HFIles with more than one reducer then some values for 
> keys are not appeared in the table after loadtable.rb script execution. With 
> one reducer everything works fine.
> References:
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk
> - the row id must be formatted as a ImmutableBytesWritable
> - MR job should ensure a total ordering among all keys
> MAPREDUCE-366  (patch-5668-3.txt)
> - TotalOrderPartitioner that uses the new API (attached)
> HBASE-2063
> - patched HFileOutputFormat (attached)
> Input data (attached):
> * my_sample_log_1k.txt - sample data, input for MyHFilesWriter
> Source (attached):
> * MyKeyComparator.java - comparator for my ImmutableBytesWritable keys
> * TestTotalOrderPartitionerForMyKeys.java - test case for my keys (note that 
> I've set up MyKeyComparator to pass that test)
> * MyHFilesWriter.java  - My MR job to prepare HFiles
> * HFileOutputFormat.java - from MAPREDUCE-366
> * TotalOrderPartitioner.java - from MAPREDUCE-366
> * MySampler.java - My RandomSampler based on Sampler from MAPREDUCE-366 BUT 
> I've put the following string into getSample method (without that string it 
> doesn't work):
> {code}
>             reader.initialize(splits.get(i), new 
> TaskAttemptContext(job.getConfiguration(), new TaskAttemptID()));
> {code}
> Test case:
> # hadoop jar keyvalue-poc.jar MyHFilesWriter -in 
> /test_hbase/my_sample_log_1k.txt -out /test_hbase/hfiles/01/ -r 1
> # hadoop jar keyvalue-poc.jar MyHFilesWriter -in 
> /test_hbase/my_sample_log_1k.txt -out /test_hbase/hfiles/02/ -r 2
> # hbase> create 'tst_hfiles_01', {NAME => 'vals'}
> # hbase> create 'tst_hfiles_02', {NAME => 'vals'}
> # hbase org.jruby.Main /usr/lib/hbase-0.20/bin/loadtable.rb tst_hfiles_01 
> /test_hbase/hfiles/01
> # hbase org.jruby.Main /usr/lib/hbase-0.20/bin/loadtable.rb tst_hfiles_02 
> /test_hbase/hfiles/02
> # check values for keys
> for example:
> {code}
> hbase(main):006:0* count 'tst_hfiles_01', 100 
> Current count: 100, row: 0.14.USA.IL.602.ELMHURST.1.1.0.0                     
>                                 
> Current count: 200, row: 0.245.USA.ME.500.PORTLAND.1.1.0.0                    
>                                 
> Current count: 300, row: 0.34.USA.FL.Rollup.Rollup.1.1.0.0                    
>                                 
> Current count: 400, row: 0.443.USA.CA.803.LOS.ANGELES.1.1.0                   
>                                 
> Current count: 500, row: 0.8.USA.CO.751.CASTLE.ROCK.1.1.0                     
>                                 
> Current count: 600, row: 1.14.DZA.Rollup.Rollup.Rollup.1.1.0.1                
>                                 
> Current count: 700, row: 1.159.SWE.AB.Rollup.Rollup.1.1.0.1                   
>                                 
> Current count: 800, row: 1.17.USA.TN.659.CLARKSVILLE.1.1.0.1                  
>                                 
> Current count: 900, row: 1.220.USA.MI.505.SOUTHFIELD.1.1.0.1                  
>                                 
> 999 row(s) in 0.0930 seconds
> hbase(main):007:0> count 'tst_hfiles_02', 100
> Current count: 100, row: 0.231.USA.GA.524.BUFORD.1.1.0.1                      
>                                 
> Current count: 200, row: 0.4.USA.VA.573.Rollup.1.1.0.0                        
>                                 
> Current count: 300, row: 0.9.ROU.B.-1.BUCHAREST.1.1.0.0                       
>                                 
> Current count: 400, row: 1.16.USA.IA.679.Rollup.1.1.1.0                       
>                                 
> Current count: 500, row: 1.245.NOR.03.-1.OSLO.1.1.0.0                         
>                                 
> Current count: 600, row: 0.245.GBR.ENG.826005.BEXLEY.1.1.0.1                  
>                                 
> Current count: 700, row: 0.48.GBR.ENG.826027.Rollup.1.1.0.1                   
>                                 
> Current count: 800, row: 1.14.SWE.Rollup.Rollup.Rollup.1.1.0.1                
>                                 
> Current count: 900, row: 1.201.GBR.ENG.826005.LONDON.1.1.0.1                  
>                                 
> 999 row(s) in 0.1630 seconds
> hbase(main):008:0> get 'tst_hfiles_01', '0.14.USA.IL.602.ELMHURST.1.1.0.0'
> COLUMN                       CELL                                             
>                                 
>  vals:key0                   timestamp=1269542753914, value=0                 
>                                 
>  vals:key1                   timestamp=1269542753914, value=14                
>                                 
>  vals:key2                   timestamp=1269542753914, value=USA               
>                                 
>  vals:key3                   timestamp=1269542753914, value=IL                
>                                 
>  vals:key4                   timestamp=1269542753914, value=602               
>                                 
>  vals:key5                   timestamp=1269542753914, value=ELMHURST          
>                                 
>  vals:key6                   timestamp=1269542753914, value=1                 
>                                 
>  vals:key7                   timestamp=1269542753914, value=1                 
>                                 
>  vals:key8                   timestamp=1269542753914, value=0                 
>                                 
>  vals:key9                   timestamp=1269542753914, value=0                 
>                                 
>  vals:val0                   timestamp=1269542753914, value=2                 
>                                 
> 11 row(s) in 0.0160 seconds
> hbase(main):009:0> get 'tst_hfiles_02', '0.14.USA.IL.602.ELMHURST.1.1.0.0'
> COLUMN                       CELL                                             
>                                 
> 0 row(s) in 0.0220 seconds
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to