[ 
https://issues.apache.org/jira/browse/HADOOP-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829361#comment-16829361
 ] 

Sahil Takiar commented on HADOOP-16241:
---------------------------------------

Responding to the rest of Steve's comments:

{quote}
The current implementation of PositionReadable in S3AInputStream is pretty 
close to the default implementation in FsInputStream.
{quote}

I think I was referring to {{read(long position, byte[] buffer, int offset, int 
length)}} which is not overridden in {{S3AInputStream}} (I know it does 
override {{readFully(long position, ...)}}). Regardless, the part I left out is 
that the implementation of {{readFully(long position, ...)}} in 
{{S3AInputStream}} and {{read(long position, ...)}} in {{FSInputStream}} have 
different semantics from {{DFSInputStream}}. I know the semantics can't match 
up completely given the nature of HDFS vs. S3, but I think the proposal here is 
to align the two w.r.t the following behavior. The implementation of 
{{PositionedReadable}} in {{S3AInputStream}} uses a synchronized block, which 
would hurt throughput if applications try to use the same file handle across 
threads; on the other hand, {{DFSInputStream}} uses a dedicated {{BlockReader}} 
for {{PositionedReadable}} methods so issuing parallel preads against a single 
HDFS file handle won't get throttled by a Java lock.

{quote}
* Data is only read in 64 kb chunks, so for a large read, several GET requests 
must be issued to S3 to fetch the data; while the 64 kb chunk value is 
configurable, it is hard to set a reasonable value for variable length preads
* If the readahead value is too big, closing the input stream can take 
considerable time because the stream has to be drained of data before it can be 
closed
{quote}

Agree its generally low cost. I think the issue I saw was that for small files, 
the readahead value would cause most read requests to read until the end of the 
file unnecessarily. Setting it to a smaller value is possible, but then apps 
would have to build some type of logic to dynamically set this based on the the 
size of the read request + file size. I'm not proposing removing the readahead 
logic completely, just for the pread / fadvise = RANDOM logic where it might be 
less useful? However, given it doesn't introduce much overhead, it's not a 
major issue.

I think the fundamental issue with Impala is that the current behavior does a 
bunch of reads from random offsets (each read is on either (1) a new file 
handle, or (2) an unbuffered file handle). Setting fadvise = RANDOM does fix 
the issue, so if we don't think implementing this is worth it, that is fine 
with me. I have a prototype I've been working on, and it requires a good amount 
of re-factoring of the {{S3AInputStream}}.

Impala does not currently use the same file handle across multiple threads, 
that is a future optimization that first requires switching Impala over to use 
preads (IMPALA-5212).

So in summary: (1) Impala traces + flamegraphs are attached, (2) setting 
fadvise = RANDOM solves most of the existing perf issues, (3) Impala currently 
does not use the same file handle for preads (even for HDFS).

> S3AInputStream PositionReadable should perform ranged read on dedicated 
> stream 
> -------------------------------------------------------------------------------
>
>                 Key: HADOOP-16241
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16241
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: Impala-TPCDS-scans.zip, 
> impala-web_returns-scan-flamegraph.svg
>
>
> The current implementation of {{PositionReadable}} in {{S3AInputStream}} is 
> pretty close to the default implementation in {{FsInputStream}}.
> This JIRA proposes overriding the {{read(long position, byte[] buffer, int 
> offset, int length)}} method and re-implementing the {{readFully(long 
> position, byte[] buffer, int offset, int length)}} method in S3A.
> The new implementation would perform a "ranged read" on a dedicated object 
> stream (rather than the shared one). Prototypes have shown this to bring a 
> considerable performance improvement to readers who are only interested in 
> reading a random chunk of the file at a time (e.g. Impala, although I would 
> assume HBase would benefit from this as well).
> Setting {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} is helpful for 
> clients that rely on pread, but has a few drawbacks:
>  * Unless the client explicitly sets fadvise to RANDOM, they will get at 
> least one connection reset when the backwards seek is issued (after which 
> fadvise automatically switches to RANDOM)
>  * Data is only read in 64 kb chunks, so for a large read, several GET 
> requests must be issued to S3 to fetch the data; while the 64 kb chunk value 
> is configurable, it is hard to set a reasonable value for variable length 
> preads
>  * If the readahead value is too big, closing the input stream can take 
> considerable time because the stream has to be drained of data before it can 
> be closed
> The new implementation of {{PositionReadable}} would issue a 
> {{GetObjectRequest}} with the range specified by {{position}} and the size of 
> the given buffer. The data would be read from the {{S3ObjectInputStream}} and 
> then closed at the end of the method. This stream would be independent of the 
> {{wrappedStream}} currently maintained by S3A.
> This brings the following benefits:
>  * The {{PositionedReadable}} methods can be thread-safe without a 
> {{synchronized}} block, which allows clients to concurrently call pread 
> methods on the same {{S3AInputStream}} instance
>  * preads will request all the data at once rather than requesting it in 
> chunks via the readahead logic
>  * Avoids performing potentially expensive seeks when performing preads



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to