[
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568368#comment-16568368
]
Uwe L. Korn commented on PARQUET-1370:
--------------------------------------
[~rgruener] I was also plagued by this issue but I wrapped my Python code in
[https://docs.python.org/3/library/io.html#io.BufferedReader] and this gave me
sufficient performance. This was especially useful for me as I'm working with
object stores like S3 or Azure Blob where consecutive reads of 40kb or 512kb
nearly make no difference but the HTTP request overhead is the main bottleneck.
> Read consecutive column chunks in a single scan
> -----------------------------------------------
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-cpp
> Reporter: Robert Gruener
> Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page
> see
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small
> columns. The java implementation already does this and will read consecutive
> column chunks (and the resulting pages) in a single scan see
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>
> This might be a bit difficult to do, as it would require changing a lot of
> the code structure but it would certainly be valuable for workloads concerned
> with optimal read performance.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)