Wes McKinney created PARQUET-1698:
-------------------------------------

             Summary: [C++] Add reader option to pre-buffer entire serialized 
row group into memory
                 Key: PARQUET-1698
                 URL: https://issues.apache.org/jira/browse/PARQUET-1698
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-cpp
            Reporter: Wes McKinney
             Fix For: cpp-1.6.0


In some scenarios (example: reading datasets from Amazon S3), reading columns 
independently and allowing unbridled {{Read}} calls to the underlying file 
handle can yield suboptimal performance. In such cases, it may be preferable to 
first read the entire serialized row group into memory then deserialize the 
constituent columns from this

Note that such an option would not be appropriate as a default behavior for all 
file handle types since low-selectivity reads (example: reading only 3 columns 
out of a file with 100 columns)  will be suboptimal in some cases. I think it 
would be better for "high latency" file systems to opt into this option

cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to