[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format

Steve Loughran (Jira) Thu, 11 Aug 2022 12:53:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578637#comment-17578637
 ]


Steve Loughran commented on PARQUET-2171:
-----------------------------------------

bq. I have found ByteBuffer to impose a nontrivial amount of overhead, and you 
might want to consider providing array-based methods as well.

mixed feelings. its hard to work with but some libraries (parquet...) love it, 
which partly drove our use of it. if you use on heap buffers is just arrays 
with more hassle.

FwIW, i was looking at some of the parquet read code and concluding that the 
s3a FS should implement read(bytebyffer)  as a single vectored IO read. 
currently the base class implementation reads into a temp byte array and so 
breaks prefetching...the s3afs only sees the read(bytes) of the shorter array, 
not the full amount wanted

> Implement vectored IO in parquet file format
> --------------------------------------------
>
>                 Key: PARQUET-2171
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2171
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Mukund Thakur
>            Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving 
> read performance for seek heavy readers. Spark Jobs and others which uses 
> parquet will greatly benefit from this api. Details can be found here 
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format

Reply via email to