[ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014777#comment-17014777
 ] 

Jacques Nadeau commented on PARQUET-1698:
-----------------------------------------

In our internal work we actually separate this out between the responsibility 
of the IO layer and the reader. This allows the IO layer to figure out how 
aggressively it reads ahead based on other attributes such as current memory 
availability. I suggest separating it out that way as opposed to the row group 
reader actually doing this work itself. Basically, you want to say: I have 
several independent streams within the same file that I will probably read with 
these ranges--don't blow the readahead. You could then have one implementation 
that just does full prebuffering and another which is more conservative with 
memory.

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-1698
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1698
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Assignee: Zherui Cao
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: cpp-1.6.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to