[jira] [Commented] (HADOOP-11867) FS API: Add a high-performance vectored Read to FSDataInputStream API

Steve Loughran (JIRA) Mon, 03 Dec 2018 04:26:11 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-11867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707117#comment-16707117
 ]


Steve Loughran commented on HADOOP-11867:
-----------------------------------------

PositionedReadable is an interface. The good news: Java 8 defaults. It should 
be possible to create a default implementation which hands off to some helper 
class & implement the base IO somehow, maybe multiple strategies like

* pure FS: sequential, maybe starting @ current known offset & go forward.
* object store with sequential GET & expensive seek: merge reads.

There will be some "fun" concurrency issues here.

* If 1+ future read is live, will you be able to make other PositionedReadable 
calls or will they block?. 
* When changes happen in a file, what happens to the readers. Me: no guarantees 
as to order of execution, hence, no guarantees about consistency
* If there is 1+ active async read, and a new async read, will the new async 
read be executed after the active ones. Me: no guarantees, they may just get 
added to a pool of active reads.
* Allow each file range to take a long ID. Lets you have a map of ranges; helps 
calling apps understand which bit of the read they've been called back on 
without having to use the range itself as the index.

We need to actually prototype use of this to see how well it works. I'd propose 
some test cases written with RxJava; [~ehiggs] has been praising that. In my 
tests of HADOOP-15229 the inability of java streams to directly handle IOEs is 
a pain.




> FS API: Add a high-performance vectored Read to FSDataInputStream API
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-11867
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11867
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Gopal V
>            Assignee: Owen O'Malley
>            Priority: Major
>              Labels: performance
>
> The most significant way to read from a filesystem in an efficient way is to 
> let the FileSystem implementation handle the seek behaviour underneath the 
> API to be the most efficient as possible.
> A better approach to the seek problem is to provide a sequence of read 
> locations as part of a single call, while letting the system schedule/plan 
> the reads ahead of time.
> This is exceedingly useful for seek-heavy readers on HDFS, since this allows 
> for potentially optimizing away the seek-gaps within the FSDataInputStream 
> implementation.
> For seek+read systems with even more latency than locally-attached disks, 
> something like a {{readFully(long[] offsets, ByteBuffer[] chunks)}} would 
> take of the seeks internally while reading chunk.remaining() bytes into each 
> chunk (which may be {{slice()}}ed off a bigger buffer).
> The base implementation can stub in this as a sequence of seeks + read() into 
> ByteBuffers, without forcing each FS implementation to override this in any 
> way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-11867) FS API: Add a high-performance vectored Read to FSDataInputStream API

Reply via email to