Łukasz Gajowy created BEAM-3484:
-----------------------------------
Summary: HadoopInputFormatIO reads big datasets invalid
Key: BEAM-3484
URL: https://issues.apache.org/jira/browse/BEAM-3484
Project: Beam
Issue Type: Bug
Components: beam-model, runner-dataflow
Reporter: Łukasz Gajowy
Assignee: Kenneth Knowles
For big datasets HadoopInputFormat sometimes skips/duplicates elements from
database in resulting PCollection. This results in incorrect read result.
Occurred to me while developing HadoopInputFormatIOIT and running it on
dataflow. For datasets smaller or equal to 600 000 database rows I wasn't able
to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 1 000
000.
Attachments:
- text file with sorted HadoopInputFormat.read() result saved using
TextIO.write().to().withoutSharding(). If you look carefully you'll notice
duplicates or missing values that should not happen
- same text file for 600 000 records not having any duplicates and missing
elements
- link to a PR with HadoopInputFormatIO integration test that allows to
reproduce this issue. At the moment of writing, this code is not merged yet.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)