[ 
https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385377#comment-16385377
 ] 

Paul Rogers commented on DRILL-6147:
------------------------------------

@Aman, thanks for the comment. I absolutely agree that we do not want to 
penalize any workloads.

The key point is that much time was spent in the result set loader approach to 
provide optimal performance. It does so for both flat and nested structures. 
Indeed, under the result set loader, these two models are more-or-less unified 
and use a single high-performance approach.

The question for you and [~sachouche] is simply this. Given that we have a 
working mechanism, does it make sense to invent another one? Do we want to have 
duplicate maintenance costs? Have to make changes in two places? And so on?

Given that the result set loader works for both flat and nested structures, and 
performs equally well in both cases, we have an opportunity, if Parquet allows, 
to have a single Parquet reader rather than the two we have today. Having one 
reader also reduces ongoing maintenance costs.

So, it is purely a cost issue: should we create multiple implementations or 
strive to have one? Having two is valid when the needs differ (as in the 
original Parquet readers), but if the two implementations are similar (i.e. 
both are fast), then the equation is not so clear.

The key change here is bulk loading. However, I have not yet seen a clear 
description of how we coordinate that operation:

* Across multiple variable-width columns
* Across multiple batches
* While observing vector and batch size limits

Having made this work in the result set loader, I understand how hard it is to 
handle all these cases together. If we do want to create a second 
implementation, then we do need a clear description of how it will work.

Or, what might be simpler, simply create a test case that demonstrates that the 
solution does work well under large file sizes, variable column widths, and 
constrained memory use.

IMHO, the second solution should demonstrate large gains over the result set 
loader solution to justify building two parallel solutions. The test will 
demonstrate that difference so that the decision is clear.

> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
>                 Key: DRILL-6147
>                 URL: https://issues.apache.org/jira/browse/DRILL-6147
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.14.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows) 
> when creating scan batches; there is no parameter nor any logic for 
> controlling the amount of memory used. This enhancement will allow Drill to 
> take an extra input parameter to control direct memory usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to