[jira] [Commented] (DRILL-6147) Limit batch size for Flat Parquet Reader

Aman Sinha (JIRA) Tue, 13 Feb 2018 10:26:37 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16362833#comment-16362833
 ]


Aman Sinha commented on DRILL-6147:
-----------------------------------

 

Regarding Paul's comment "Said another way, predicate push-down forces 
row-by-row processing, even though the underlying storage format is columnar. 
(This is why the Filter operator works row-by-row.)"

This is not quite true, even though currently the Filter operator works this 
way.  As described in Daniel Abadi's blog on the columnar storage formats [1],  
the vectorized processing of filter conditions yielded a 4x improvement for his 
(admittedly simple) experiment.  We do want to keep the option open for such 
enhancements.  

I do agree with Paul on the more general point that we have to be able to 
handle efficient access to complex data (arrays, maps, repeated maps) without 
running into memory situations.  It sounds to me that an adaptive algorithm is 
needed where the scanner determines whether to use the 'bulk loading columnar' 
read where appropriate while still allowing the 'result set loader row-by-row' 
read for data that is complex types. 

[1] 
[http://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html,]
 

> Limit batch size for Flat Parquet Reader
> ----------------------------------------
>
>                 Key: DRILL-6147
>                 URL: https://issues.apache.org/jira/browse/DRILL-6147
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.13.0
>
>
> The Parquet reader currently uses a hard-coded batch size limit (32k rows) 
> when creating scan batches; there is no parameter nor any logic for 
> controlling the amount of memory used. This enhancement will allow Drill to 
> take an extra input parameter to control direct memory usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6147) Limit batch size for Flat Parquet Reader

Reply via email to