[jira] [Updated] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

JIRA Tue, 16 Jan 2018 06:29:31 -0800

     [ 
https://issues.apache.org/jira/browse/BEAM-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Łukasz Gajowy updated BEAM-3484:
--------------------------------
    Description: 
For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
database in resulting PCollection. This gives incorrect read result.

Occurred to me while developing HadoopInputFormatIOIT and running it on 
dataflow. For datasets smaller or equal to 600 000 database rows I wasn't able 
to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 1 000 
000. 

Attachments:
  - text file with sorted HadoopInputFormat.read() result saved using 
TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
duplicates or missing values that should not happen

 - same text file for 600 000 records not having any duplicates and missing 
elements
 - link to a PR with HadoopInputFormatIO integration test that allows to 
reproduce this issue. At the moment of writing, this code is not merged yet.

  was:
For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
database in resulting PCollection. This results in incorrect read result.

Occurred to me while developing HadoopInputFormatIOIT and running it on 
dataflow. For datasets smaller or equal to 600 000 database rows I wasn't able 
to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 1 000 
000. 

Attachments:
 - text file with sorted HadoopInputFormat.read() result saved using 
TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
duplicates or missing values that should not happen

 - same text file for 600 000 records not having any duplicates and missing 
elements
- link to a PR with HadoopInputFormatIO integration test that allows to 
reproduce this issue. At the moment of writing, this code is not merged yet.


> HadoopInputFormatIO reads big datasets invalid
> ----------------------------------------------
>
>                 Key: BEAM-3484
>                 URL: https://issues.apache.org/jira/browse/BEAM-3484
>             Project: Beam
>          Issue Type: Bug
>          Components: beam-model, runner-dataflow
>            Reporter: Łukasz Gajowy
>            Assignee: Kenneth Knowles
>            Priority: Major
>         Attachments: result_sorted1000000, result_sorted600000
>
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

Reply via email to