[
https://issues.apache.org/jira/browse/PARQUET-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Dai updated PARQUET-1947:
--------------------------------
Attachment: Part1.java
> DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong
> data
> -------------------------------------------------------------------------------
>
> Key: PARQUET-1947
> URL: https://issues.apache.org/jira/browse/PARQUET-1947
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cascading
> Reporter: Daniel Dai
> Priority: Major
> Attachments: Part1.java
>
>
> When we read parquet file using cascading 2, we observe wrong data in the
> file boundary when we turn on input combine in cascading (setUseCombinedInput
> to true).
> This can be reproduced easily with two parquet input files, each containing
> one record. A simple cascading application (attached) read the two input with
> setUseCombinedInput(true). What we get is the duplicated record in the first
> input file and the missing record in the second input file.
> Here is the call sequence to understand what happen after the last record of
> first input:
> 1. cascading invokes DeprecatedParquetInputFormat.createValue(), that's the
> last record of first input again
> 2. CombineFileRecordReader invokes RecordReader.next and reach the EOF of
> first input
> 3. CombineFileRecordReader creates a new
> DeprecatedParquetInputFormat.RecordReaderWrapper, which creates the new
> "value" variable containing the first record of second input
> 4. CombineFileRecordReader invokes RecordReader.next on the new
> RecordReaderWrapper, but since firstRecord flag is on, next does not do
> anything
> 5. Thus the "value" variable containing the first record of second input is
> lost, and cascading is reusing the last record of first input
--
This message was sent by Atlassian Jira
(v8.3.4#803005)