Daniel Dai created PARQUET-1947:
-----------------------------------
Summary: DeprecatedParquetInputFormat in CombineFileInputFormat
would produce wrong data
Key: PARQUET-1947
URL: https://issues.apache.org/jira/browse/PARQUET-1947
Project: Parquet
Issue Type: Bug
Components: parquet-cascading
Reporter: Daniel Dai
When we read parquet file using cascading 2, we observe wrong data in the file
boundary when we turn on input combine in cascading (setUseCombinedInput to
true).
This can be reproduced easily with two parquet input files, each containing one
record. A simple cascading application (attached) read the two input with
setUseCombinedInput(true). What we get is the duplicated record in the first
input file and the missing record in the second input file.
Here is the call sequence to understand what happen after the last record of
first input:
1. cascading invokes DeprecatedParquetInputFormat.createValue(), that's the
last record of first input again
2. CombineFileRecordReader invokes RecordReader.next and reach the EOF of first
input
3. CombineFileRecordReader creates a new
DeprecatedParquetInputFormat.RecordReaderWrapper, which creates the new "value"
variable containing the first record of second input
4. CombineFileRecordReader invokes RecordReader.next on the new
RecordReaderWrapper, but since firstRecord flag is on, next does not do anything
5. Thus the "value" variable containing the first record of second input is
lost, and cascading is reusing the last record of first input
--
This message was sent by Atlassian Jira
(v8.3.4#803005)