[jira] [Commented] (HADOOP-11867) FS API: Add a high-performance vectored Read to FSDataInputStream API

Owen O'Malley (Jira) Mon, 21 Sep 2020 09:20:42 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199483#comment-17199483
 ]


Owen O'Malley commented on HADOOP-11867:
----------------------------------------

To follow up on this, the benchmarks compare:

File systems:
 * raw = raw local file system
 * local = local file system with checksums layered on top

ByteBuffer implementation:
 * direct = direct byte buffers
 * array = array backed byte buffers

Read method:
 * asyncFileChanArray = reading using java's async file channel (no hadoop fs)
 * asyncRead = my new code added in this PR
 * syncRead = the current code

So, the current code is by far the slowest and using the Java native async file 
channel is the fastest. (The code in this PR uses the async file channel and 
goes through the hadoop fs api, so that isn't surprising.) The nice bit is that 
the raw fs code gets close the the native async file channel speeds. Even the 
local fs with the checksum reads & validation is still very fast (3.75x the 
current checksum code).

 

> FS API: Add a high-performance vectored Read to FSDataInputStream API
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-11867
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11867
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, fs/s3, hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Gopal Vijayaraghavan
>            Assignee: Owen O'Malley
>            Priority: Major
>              Labels: performance, pull-request-available
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The most significant way to read from a filesystem in an efficient way is to 
> let the FileSystem implementation handle the seek behaviour underneath the 
> API to be the most efficient as possible.
> A better approach to the seek problem is to provide a sequence of read 
> locations as part of a single call, while letting the system schedule/plan 
> the reads ahead of time.
> This is exceedingly useful for seek-heavy readers on HDFS, since this allows 
> for potentially optimizing away the seek-gaps within the FSDataInputStream 
> implementation.
> For seek+read systems with even more latency than locally-attached disks, 
> something like a {{readFully(long[] offsets, ByteBuffer[] chunks)}} would 
> take of the seeks internally while reading chunk.remaining() bytes into each 
> chunk (which may be {{slice()}}ed off a bigger buffer).
> The base implementation can stub in this as a sequence of seeks + read() into 
> ByteBuffers, without forcing each FS implementation to override this in any 
> way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-11867) FS API: Add a high-performance vectored Read to FSDataInputStream API

Reply via email to