[
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578637#comment-17578637
]
Steve Loughran commented on PARQUET-2171:
-----------------------------------------
bq. I have found ByteBuffer to impose a nontrivial amount of overhead, and you
might want to consider providing array-based methods as well.
mixed feelings. its hard to work with but some libraries (parquet...) love it,
which partly drove our use of it. if you use on heap buffers is just arrays
with more hassle.
FwIW, i was looking at some of the parquet read code and concluding that the
s3a FS should implement read(bytebyffer) as a single vectored IO read.
currently the base class implementation reads into a temp byte array and so
breaks prefetching...the s3afs only sees the read(bytes) of the shorter array,
not the full amount wanted
> Implement vectored IO in parquet file format
> --------------------------------------------
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Mukund Thakur
> Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving
> read performance for seek heavy readers. Spark Jobs and others which uses
> parquet will greatly benefit from this api. Details can be found hereĀ
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867
--
This message was sent by Atlassian Jira
(v8.20.10#820010)