[jira] [Created] (PARQUET-1947) DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data

Daniel Dai (Jira) Mon, 30 Nov 2020 10:44:08 -0800

Daniel Dai created PARQUET-1947:
-----------------------------------

             Summary: DeprecatedParquetInputFormat in CombineFileInputFormat 
would produce wrong data
                 Key: PARQUET-1947
                 URL: https://issues.apache.org/jira/browse/PARQUET-1947
             Project: Parquet
          Issue Type: Bug
          Components: parquet-cascading
            Reporter: Daniel Dai



When we read parquet file using cascading 2, we observe wrong data in the file 
boundary when we turn on input combine in cascading (setUseCombinedInput to 
true).

This can be reproduced easily with two parquet input files, each containing one 
record. A simple cascading application (attached) read the two input with 
setUseCombinedInput(true). What we get is the duplicated record in the first 
input file and the missing record in the second input file.

Here is the call sequence to understand what happen after the last record of 
first input:
1. cascading invokes DeprecatedParquetInputFormat.createValue(), that's the 
last record of first input again
2. CombineFileRecordReader invokes RecordReader.next and reach the EOF of first 
input
3. CombineFileRecordReader creates a new 
DeprecatedParquetInputFormat.RecordReaderWrapper, which creates the new "value" 
variable containing the first record of second input
4. CombineFileRecordReader invokes RecordReader.next on the new 
RecordReaderWrapper, but since firstRecord flag is on, next does not do anything
5. Thus the "value" variable containing the first record of second input is 
lost, and cascading is reusing the last record of first input



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-1947) DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data

Reply via email to