[
https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385377#comment-16385377
]
Paul Rogers commented on DRILL-6147:
------------------------------------
@Aman, thanks for the comment. I absolutely agree that we do not want to
penalize any workloads.
The key point is that much time was spent in the result set loader approach to
provide optimal performance. It does so for both flat and nested structures.
Indeed, under the result set loader, these two models are more-or-less unified
and use a single high-performance approach.
The question for you and [~sachouche] is simply this. Given that we have a
working mechanism, does it make sense to invent another one? Do we want to have
duplicate maintenance costs? Have to make changes in two places? And so on?
Given that the result set loader works for both flat and nested structures, and
performs equally well in both cases, we have an opportunity, if Parquet allows,
to have a single Parquet reader rather than the two we have today. Having one
reader also reduces ongoing maintenance costs.
So, it is purely a cost issue: should we create multiple implementations or
strive to have one? Having two is valid when the needs differ (as in the
original Parquet readers), but if the two implementations are similar (i.e.
both are fast), then the equation is not so clear.
The key change here is bulk loading. However, I have not yet seen a clear
description of how we coordinate that operation:
* Across multiple variable-width columns
* Across multiple batches
* While observing vector and batch size limits
Having made this work in the result set loader, I understand how hard it is to
handle all these cases together. If we do want to create a second
implementation, then we do need a clear description of how it will work.
Or, what might be simpler, simply create a test case that demonstrates that the
solution does work well under large file sizes, variable column widths, and
constrained memory use.
IMHO, the second solution should demonstrate large gains over the result set
loader solution to justify building two parallel solutions. The test will
demonstrate that difference so that the decision is clear.
> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
> Key: DRILL-6147
> URL: https://issues.apache.org/jira/browse/DRILL-6147
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Reporter: salim achouche
> Assignee: salim achouche
> Priority: Major
> Fix For: 1.14.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows)
> when creating scan batches; there is no parameter nor any logic for
> controlling the amount of memory used. This enhancement will allow Drill to
> take an extra input parameter to control direct memory usage.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)