[
https://issues.apache.org/jira/browse/ARROW-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414620#comment-17414620
]
Weston Pace commented on ARROW-13982:
-------------------------------------
At the moment I can't think of anything better than empty batches. At the very
least, it seems removal of empty batches is an optimization we can explore at
some future date when the exec plan is more sophisticated.
> dataset scanner stalls when reading parquet with filtering.
> -----------------------------------------------------------
>
> Key: ARROW-13982
> URL: https://issues.apache.org/jira/browse/ARROW-13982
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 5.0.0
> Environment: ubuntu 18.04 LTS
> Reporter: Huxley Hu
> Assignee: David Li
> Priority: Major
> Labels: pull-request-available, query-engine
> Fix For: 6.0.0
>
> Attachments: repro.py
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> Reading parquet files using dataset scanner may stall due to a never-finished
> future.
> To reproduce this case, one needs two parquet files and sets the filter
> expression to something that could filter one file completely. After that,
> calling `AsyncScanner::ToRecordBatchReader` and read data continually.
> I also have dug this bug a little. It's caused by the
> `MakeEmptyGenerator<std::shared_ptr<RecordBatch>>` when filtered row groups
> is empty, which's ignored by `FragmentToBatches` and causes
> SequencingGenerator to stall.
> A quick fix is to return a record batch with 0 rows instead of returning a
> nullptr there.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)