[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

Wes McKinney (Jira) Mon, 13 Jan 2020 16:09:22 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014744#comment-17014744
 ]


Wes McKinney commented on PARQUET-1698:
---------------------------------------

I think the pre-buffering should probably be implemented at the RowGroupReader 
level. Something like:

{code}
rg_reader->PreBufferColumns(column_indices);
{code}

what do you think? Then we can provide this prebuffering as an option at the 
Arrow read and Datasets level. Another option would be to set the prebuffer 
column indices in {{ReaderProperties}} (tomay-to, tomah-to, I guess). 

cc [~npr] [~fsaintjacques] [~bkietz]

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-1698
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1698
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Assignee: Zherui Cao
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: cpp-1.6.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns 
> independently and allowing unbridled {{Read}} calls to the underlying file 
> handle can yield suboptimal performance. In such cases, it may be preferable 
> to first read the entire serialized row group into memory then deserialize 
> the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for 
> all file handle types since low-selectivity reads (example: reading only 3 
> columns out of a file with 100 columns)  will be suboptimal in some cases. I 
> think it would be better for "high latency" file systems to opt into this 
> option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

Reply via email to