[
https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055488#comment-17055488
]
Wes McKinney commented on ARROW-6895:
-------------------------------------
I'll see if I can add a unit test and merge this for 0.17.0
> [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader
> repeats returned values when calling `NextBatch()`
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-6895
> URL: https://issues.apache.org/jira/browse/ARROW-6895
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 0.15.0
> Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
> Reporter: Adam Hooper
> Assignee: Francois Saint-Jacques
> Priority: Critical
> Labels: pull-request-available
> Fix For: 0.17.0
>
> Attachments: 01-fix-arrow-6895.diff, bad.parquet,
> reset-dictionary-on-read.diff, works.parquet
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Given most columns, I can run a loop like:
> {code:cpp}
> std::unique_ptr<parquet::arrow::ColumnReader> columnReader(/*...*/);
> while (nRowsRemaining > 0) {
> int n = std::min(100, nRowsRemaining);
> std::shared_ptr<arrow::ChunkedArray> chunkedArray;
> auto status = columnReader->NextBatch(n, &chunkedArray);
> // ... and then use `chunkedArray`
> nRowsRemaining -= n;
> }
> {code}
> (The context is: "convert Parquet to CSV/JSON, with small memory footprint."
> Used in https://github.com/CJWorkbench/parquet-to-arrow)
> Normally, the first {{NextBatch()}} return value looks like {{val0...val99}};
> the second return value looks like {{val100...val199}}; and so on.
> ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The
> first {{NextBatch()}} return value looks like {{val0...val100}}; the second
> return value looks like {{val0...val99, val100...val199}} (ChunkedArray with
> two arrays); the third return value looks like {{val0...val99,
> val100...val199, val200...val299}} (ChunkedArray with three arrays); and so
> on. The returned arrays are never cleared.
> In sum: {{NextBatch()}} on a dictionary column reader returns the wrong
> values.
> I've attached a minimal Parquet file that presents this problem with the
> above code; and I've written a patch that fixes this one case, to illustrate
> where things are wrong. I don't think I understand enough edge cases to
> decree that my patch is a correct fix.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)