[jira] [Updated] (PARQUET-1947) DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data

Daniel Dai (Jira) Mon, 30 Nov 2020 10:46:34 -0800


     [ 
https://issues.apache.org/jira/browse/PARQUET-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Dai updated PARQUET-1947:
--------------------------------
    Attachment: Part1.java

> DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong 
> data
> -------------------------------------------------------------------------------
>
>                 Key: PARQUET-1947
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1947
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cascading
>            Reporter: Daniel Dai
>            Priority: Major
>         Attachments: Part1.java
>
>
> When we read parquet file using cascading 2, we observe wrong data in the 
> file boundary when we turn on input combine in cascading (setUseCombinedInput 
> to true).
> This can be reproduced easily with two parquet input files, each containing 
> one record. A simple cascading application (attached) read the two input with 
> setUseCombinedInput(true). What we get is the duplicated record in the first 
> input file and the missing record in the second input file.
> Here is the call sequence to understand what happen after the last record of 
> first input:
> 1. cascading invokes DeprecatedParquetInputFormat.createValue(), that's the 
> last record of first input again
> 2. CombineFileRecordReader invokes RecordReader.next and reach the EOF of 
> first input
> 3. CombineFileRecordReader creates a new 
> DeprecatedParquetInputFormat.RecordReaderWrapper, which creates the new 
> "value" variable containing the first record of second input
> 4. CombineFileRecordReader invokes RecordReader.next on the new 
> RecordReaderWrapper, but since firstRecord flag is on, next does not do 
> anything
> 5. Thus the "value" variable containing the first record of second input is 
> lost, and cascading is reusing the last record of first input



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PARQUET-1947) DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data

Reply via email to