[jira] [Comment Edited] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

Alexey Romanenko (JIRA) Wed, 18 Apr 2018 08:50:29 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442721#comment-16442721
 ]


Alexey Romanenko edited comment on BEAM-3484 at 4/18/18 3:49 PM:
-----------------------------------------------------------------

This issue can be easily reproduced by increasing the number of splits, it can 
be done with this config option:
{code:java}
conf.setInt("mapreduce.job.maps", 16);
{code}
The more splits we have, the more chances that we will hit this issue.

The root cause of this is the following - for every split it creates and 
perform a separate SQL query, like:

{code:java}
SELECT id, name FROM TableName LIMIT Y OFFSET X
{code}

where X and Y depend on split and total number of rows in table.

Since, RDBMS doesn't guarantee the order of results in case when "ORDER BY" was 
not used then we can have an intersection of some splits and some rows will be 
duplicated and others will be missed. To avoid this, SQL query should order 
results by one or several keys, like:

{code:java}
SELECT id, name FROM TableName ORDER BY id LIMIT Y OFFSET X 
{code}



was (Author: aromanenko):
This issue can be easily reproduced by increasing the number of splits, it can 
be done with this config option:
{code:java}
conf.setInt("mapreduce.job.maps", 16);
{code}
The more splits we have, the more chances that we will hit this issue.

The root cause of this is the following - for every split it creates and 
perform a separate SQL query, like:

{code:java}
SELECT id, name FROM TableName LIMIT Y OFFSET X
{code}

where X and Y depend on split and total number of rows in table.

Since, RDBMS doesn't guarantee the order of results in case when "ORDER BY" was 
not used then we can have an intersection of some splits and some rows will be 
duplicated and others will be missed. To avoid this, SQL query should order 
results by one or several keys.

> HadoopInputFormatIO reads big datasets invalid
> ----------------------------------------------
>
>                 Key: BEAM-3484
>                 URL: https://issues.apache.org/jira/browse/BEAM-3484
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-hadoop
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Łukasz Gajowy
>            Assignee: Alexey Romanenko
>            Priority: Minor
>             Fix For: 2.5.0
>
>         Attachments: result_sorted1000000, result_sorted600000
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

Reply via email to