[
https://issues.apache.org/jira/browse/PARQUET-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17240975#comment-17240975
]
ASF GitHub Bot commented on PARQUET-1947:
-----------------------------------------
daijyc opened a new pull request #844:
URL: https://github.com/apache/parquet-mr/pull/844
…would produce wrong data
Make sure you have checked _all_ steps below.
### Jira
- [ ] My PR addresses the following [Parquet
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
them in the PR title:
- https://issues.apache.org/jira/browse/PARQUET-1947
### Tests
- [ ] My PR adds the following unit tests:
DeprecatedInputFormatTest.testCombineParquetInputFormat
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong
> data
> -------------------------------------------------------------------------------
>
> Key: PARQUET-1947
> URL: https://issues.apache.org/jira/browse/PARQUET-1947
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cascading
> Reporter: Daniel Dai
> Priority: Major
> Attachments: Part1.java
>
>
> When we read parquet file using cascading 2, we observe wrong data in the
> file boundary when we turn on input combine in cascading (setUseCombinedInput
> to true).
> This can be reproduced easily with two parquet input files, each containing
> one record. A simple cascading application (attached) read the two input with
> setUseCombinedInput(true). What we get is the duplicated record in the first
> input file and the missing record in the second input file.
> Here is the call sequence to understand what happen after the last record of
> first input:
> 1. cascading invokes DeprecatedParquetInputFormat.createValue(), that's the
> last record of first input again
> 2. CombineFileRecordReader invokes RecordReader.next and reach the EOF of
> first input
> 3. CombineFileRecordReader creates a new
> DeprecatedParquetInputFormat.RecordReaderWrapper, which creates the new
> "value" variable containing the first record of second input
> 4. CombineFileRecordReader invokes RecordReader.next on the new
> RecordReaderWrapper, but since firstRecord flag is on, next does not do
> anything
> 5. Thus the "value" variable containing the first record of second input is
> lost, and cascading is reusing the last record of first input
--
This message was sent by Atlassian Jira
(v8.3.4#803005)