[jira] [Commented] (HBASE-9759) IntegrationTestBulkLoad random number collision

Enis Soztutar (JIRA) Tue, 15 Oct 2013 11:01:16 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-9759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13795442#comment-13795442
 ]


Enis Soztutar commented on HBASE-9759:
--------------------------------------

The data was gone, so I was not able to look at the actual data.
Thanks Stack for looking. The way to prevent collisions is something like: 
 - for chainIds generated for each mapper, we hash out the least significant 
bits of the random so that (chainId % num_total_maps) == unique taskId across 
jobs
 - for row keys within a chain, we hash out the least significant bits of the 
random so that (rowKey % chain_length) == sort_index
The batch change is there just for lesser rpcs. 100 seemed too low. 

> IntegrationTestBulkLoad random number collision
> -----------------------------------------------
>
>                 Key: HBASE-9759
>                 URL: https://issues.apache.org/jira/browse/HBASE-9759
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0, 0.96.1
>
>         Attachments: hbase-9759_v1.patch
>
>
> ITBL failed recently in our test harness. Inspecting the failure made me 
> believe that the only reason that particular failure might have happened is 
> that there is a collision in random longs generated by the test. 
> The test creates 50 mappers by default, and each mapper writes a 500K random 
> rows starting with row = 0. By default there are 5 iterations.
> The check job outputs these counters: 
> {code}
> 2013-10-13 07:48:01,134 Map input records=124999751
> 2013-10-13 07:48:01,134 Map output records=124999999
> {code}
> The number of input records seems fine because
> {code}
> 124999751 = 1 + 5 * (0.5M - 1) * 50
> {code}
> 5 = num iterations, 0.5M = num rows, 50 = num mappers, and 1 is for row =0 
> which every chain writes to. 
> Output records should be 125M, however we see one cell missing. Since the map 
> input records matches expected number of distinct rows, I suspect that row = 
> 0 had a collision. 
> In one of the generate jobs, we can see that the reducer output count does 
> not match the reducer input count. Given that we are using KVSortReducer, 
> this confirms that there is a collision in KeyValues received by this task.
> {code}
> 2013-10-13 06:48:12,738 Reduce input records=75000000
> 2013-10-13 06:48:12,738 Reduce output records=74999997
> {code}
> The count is off by 3 because we are writing 3 columns per row. 
> My only theory for explaining this is that we had a collision in chainId's or 
> one of the chains reused row = 0 as the next row. 
> This is similar to HBASE-8700, however, in this the probability is much much 
> much lower. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HBASE-9759) IntegrationTestBulkLoad random number collision

Reply via email to