[
https://issues.apache.org/jira/browse/DRILL-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997100#comment-15997100
]
Paul Rogers commented on DRILL-5472:
------------------------------------
This is a known issue with Parquet, but one that is not currently a high
priority.
The thought here is that this issue will be resolved as a side-effect of the
fix for DRILL-5211. For that bug, we must limit vector sizes to 16 MB. At
present, the Parquet reader tries, but fails, to limit vector sizes. That
failure causes random vector sizes and low density. Fixing the Parquet vector
limit to avoid fragmentation will also, perhaps, reduced the low-density issue
without the issue itself having to be a high priority.
> Parquet reader generating low-density batches causing Sort operator to spill
> un-necessarily
> -------------------------------------------------------------------------------------------
>
> Key: DRILL-5472
> URL: https://issues.apache.org/jira/browse/DRILL-5472
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators, Storage - Parquet
> Reporter: Rahul Challapalli
> Assignee: Paul Rogers
> Attachments: drill5472.log, drill5472.parquet, drill5472.sys.drill
>
>
> git.commit.id.abbrev=1e0a14c
> The parquet file used in the below query is ~20MB. The uncompressed size id
> ~1.2 GB. Now the below query has a sort which is given ~6GB memory for a
> single fragment and yet it spills.
> {code}
> select * from (select * from
> dfs.`/drill/testdata/resource-manager/all_types_large` s order by
> s.missing12.x) d where d.missing3 is false;
> {code}
> The profile indicates that the above query has spilled twice. Attached the
> profile and the logs
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)