[
https://issues.apache.org/jira/browse/HBASE-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ruslan Salyakhov updated HBASE-2378:
------------------------------------
Attachment: MyHFilesWriter.java
HFileOutputFormat.java
my_sample_log_1k.txt
> Bulk insert with multiple reducers
> ----------------------------------
>
> Key: HBASE-2378
> URL: https://issues.apache.org/jira/browse/HBASE-2378
> Project: Hadoop HBase
> Issue Type: Bug
> Components: mapreduce
> Affects Versions: 0.20.3
> Reporter: Ruslan Salyakhov
> Attachments: HFileOutputFormat.java, my_sample_log_1k.txt,
> MyHFilesWriter.java, MyKeyComparator.java, MySampler.java,
> TestTotalOrderPartitionerForMyKeys.java, TotalOrderPartitioner.java
>
>
> If I run MR to prepare HFIles with more than one reducer then some values for
> keys are not appeared in the table after loadtable.rb script execution. With
> one reducer everything works fine.
> References:
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk
> - the row id must be formatted as a ImmutableBytesWritable
> - MR job should ensure a total ordering among all keys
> MAPREDUCE-366 (patch-5668-3.txt)
> - TotalOrderPartitioner that uses the new API (attached)
> HBASE-2063
> - patched HFileOutputFormat (attached)
> Input data (attached):
> * my_sample_log_1k.txt - sample data, input for MyHFilesWriter
> Source (attached):
> * MyKeyComparator.java - comparator for my ImmutableBytesWritable keys
> * TestTotalOrderPartitionerForMyKeys.java - test case for my keys (note that
> I've set up MyKeyComparator to pass that test)
> * MyHFilesWriter.java - My MR job to prepare HFiles
> * HFileOutputFormat.java - from MAPREDUCE-366
> * TotalOrderPartitioner.java - from MAPREDUCE-366
> * MySampler.java - My RandomSampler based on Sampler from MAPREDUCE-366 BUT
> I've put the following string into getSample method (without that string it
> doesn't work):
> {code}
> reader.initialize(splits.get(i), new
> TaskAttemptContext(job.getConfiguration(), new TaskAttemptID()));
> {code}
> Test case:
> # hadoop jar keyvalue-poc.jar MyHFilesWriter -in
> /test_hbase/my_sample_log_1k.txt -out /test_hbase/hfiles/01/ -r 1
> # hadoop jar keyvalue-poc.jar MyHFilesWriter -in
> /test_hbase/my_sample_log_1k.txt -out /test_hbase/hfiles/02/ -r 2
> # hbase> create 'tst_hfiles_01', {NAME => 'vals'}
> # hbase> create 'tst_hfiles_02', {NAME => 'vals'}
> # hbase org.jruby.Main /usr/lib/hbase-0.20/bin/loadtable.rb tst_hfiles_01
> /test_hbase/hfiles/01
> # hbase org.jruby.Main /usr/lib/hbase-0.20/bin/loadtable.rb tst_hfiles_02
> /test_hbase/hfiles/02
> # check values for keys
> for example:
> {code}
> hbase(main):006:0* count 'tst_hfiles_01', 100
> Current count: 100, row: 0.14.USA.IL.602.ELMHURST.1.1.0.0
>
> Current count: 200, row: 0.245.USA.ME.500.PORTLAND.1.1.0.0
>
> Current count: 300, row: 0.34.USA.FL.Rollup.Rollup.1.1.0.0
>
> Current count: 400, row: 0.443.USA.CA.803.LOS.ANGELES.1.1.0
>
> Current count: 500, row: 0.8.USA.CO.751.CASTLE.ROCK.1.1.0
>
> Current count: 600, row: 1.14.DZA.Rollup.Rollup.Rollup.1.1.0.1
>
> Current count: 700, row: 1.159.SWE.AB.Rollup.Rollup.1.1.0.1
>
> Current count: 800, row: 1.17.USA.TN.659.CLARKSVILLE.1.1.0.1
>
> Current count: 900, row: 1.220.USA.MI.505.SOUTHFIELD.1.1.0.1
>
> 999 row(s) in 0.0930 seconds
> hbase(main):007:0> count 'tst_hfiles_02', 100
> Current count: 100, row: 0.231.USA.GA.524.BUFORD.1.1.0.1
>
> Current count: 200, row: 0.4.USA.VA.573.Rollup.1.1.0.0
>
> Current count: 300, row: 0.9.ROU.B.-1.BUCHAREST.1.1.0.0
>
> Current count: 400, row: 1.16.USA.IA.679.Rollup.1.1.1.0
>
> Current count: 500, row: 1.245.NOR.03.-1.OSLO.1.1.0.0
>
> Current count: 600, row: 0.245.GBR.ENG.826005.BEXLEY.1.1.0.1
>
> Current count: 700, row: 0.48.GBR.ENG.826027.Rollup.1.1.0.1
>
> Current count: 800, row: 1.14.SWE.Rollup.Rollup.Rollup.1.1.0.1
>
> Current count: 900, row: 1.201.GBR.ENG.826005.LONDON.1.1.0.1
>
> 999 row(s) in 0.1630 seconds
> hbase(main):008:0> get 'tst_hfiles_01', '0.14.USA.IL.602.ELMHURST.1.1.0.0'
> COLUMN CELL
>
> vals:key0 timestamp=1269542753914, value=0
>
> vals:key1 timestamp=1269542753914, value=14
>
> vals:key2 timestamp=1269542753914, value=USA
>
> vals:key3 timestamp=1269542753914, value=IL
>
> vals:key4 timestamp=1269542753914, value=602
>
> vals:key5 timestamp=1269542753914, value=ELMHURST
>
> vals:key6 timestamp=1269542753914, value=1
>
> vals:key7 timestamp=1269542753914, value=1
>
> vals:key8 timestamp=1269542753914, value=0
>
> vals:key9 timestamp=1269542753914, value=0
>
> vals:val0 timestamp=1269542753914, value=2
>
> 11 row(s) in 0.0160 seconds
> hbase(main):009:0> get 'tst_hfiles_02', '0.14.USA.IL.602.ELMHURST.1.1.0.0'
> COLUMN CELL
>
> 0 row(s) in 0.0220 seconds
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.