[
https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528124#comment-16528124
]
ASF GitHub Bot commented on DRILL-6147:
---------------------------------------
ilooner commented on issue #1330: DRILL-6147: Adding Columnar Parquet Batch
Sizing functionality
URL: https://github.com/apache/drill/pull/1330#issuecomment-401453225
@vrozov My understanding was the following. QA has setup automatic tests of
both the performance of batch sizing as well as correctness on a real cluster.
Each batch sizing change has unit tests to validate batch size. But on a real
cluster with real data, the only viable way to validate right now for QA is to
check the batch sizes output by an operator is through logging. Since Drill
takes testing on real clusters seriously and aims to do more than just unit
tests, I think this is perfectly acceptable.
Since logging has overhead, and QA wanted to automate both the performance
and correctness tests, they required the ability to turn logging off via sql
line. This was the approach agreed on by developers and testers in the Drill
community including @sachouche, @bitblender, @ppadma, robert (don't know his
github id), and @priteshm.
Given the scope of agreement in the community, the fact that similar changes
have already been merged, and also to minor impact on the drill code itself ~20
lines; I suggest moving this discussion to a separate change. In my
investigation I was not able to find a viable alternative to this approach,
@vrozov perhaps you could present an alternative approach on the dev list and
lead the proposal. It would be a great help moving forward.
In the meantime the changes proposed here represent a valuable performance
improvement for the Drill community, so let's not hold up this change over this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
> Key: DRILL-6147
> URL: https://issues.apache.org/jira/browse/DRILL-6147
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Reporter: salim achouche
> Assignee: salim achouche
> Priority: Major
> Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows)
> when creating scan batches; there is no parameter nor any logic for
> controlling the amount of memory used. This enhancement will allow Drill to
> take an extra input parameter to control direct memory usage.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)