[ 
https://issues.apache.org/jira/browse/HADOOP-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825610#comment-16825610
 ] 

Sahil Takiar commented on HADOOP-16241:
---------------------------------------

Thanks for taking a look a this [[email protected]], always appreciate any 
feedback. I'm out of town at the moment, so won't be able to respond in full, 
but I have the Impala traces. I set the S3A log level to DEBUG, which includes 
logging of the stream stats when the file stream is closed.

The file {{Impala-web_returns-scan-logs.txt}} contains the impalad logs a full 
table scan of a 10 TB Parquet dataset on S3. The table is partitioned and is 
58.1 GB. The actual files are anywhere from less than 1 MB to ~20 MB.

The file {{Impala-store_returns-scan-logs.txt}} contains the impalad logs a 
full table scan of a 10 TB Parquet dataset on S3. The table is partitioned and 
is 167 GB. The actual files are anywhere from less than 1 MB to ~85 MB.

Some background on how Impala scans data from S3. Impala doesn't perform any 
backwards seeks (the only time it ever seeks is after it opens a file), which 
is why {{BackwardSeekOperations}} is always 0. All scans (including footer 
scans) are done on a dedicated file handle and then the file handle is closed.

I think the fundamental issue with how Impala currently scans data is since 
fadvise = NORMAL by default, and no backwards seeks are performed, the switch 
to fadvise = RANDOM never happens for Parquet. So each column chunk scan 
essentially requires opening an S3 file with the full content range requested, 
which eventually causes the HTTP connection to get reset when the file handle 
is closed.

> S3AInputStream PositionReadable should perform ranged read on dedicated 
> stream 
> -------------------------------------------------------------------------------
>
>                 Key: HADOOP-16241
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16241
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> The current implementation of {{PositionReadable}} in {{S3AInputStream}} is 
> pretty close to the default implementation in {{FsInputStream}}.
> This JIRA proposes overriding the {{read(long position, byte[] buffer, int 
> offset, int length)}} method and re-implementing the {{readFully(long 
> position, byte[] buffer, int offset, int length)}} method in S3A.
> The new implementation would perform a "ranged read" on a dedicated object 
> stream (rather than the shared one). Prototypes have shown this to bring a 
> considerable performance improvement to readers who are only interested in 
> reading a random chunk of the file at a time (e.g. Impala, although I would 
> assume HBase would benefit from this as well).
> Setting {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} is helpful for 
> clients that rely on pread, but has a few drawbacks:
>  * Unless the client explicitly sets fadvise to RANDOM, they will get at 
> least one connection reset when the backwards seek is issued (after which 
> fadvise automatically switches to RANDOM)
>  * Data is only read in 64 kb chunks, so for a large read, several GET 
> requests must be issued to S3 to fetch the data; while the 64 kb chunk value 
> is configurable, it is hard to set a reasonable value for variable length 
> preads
>  * If the readahead value is too big, closing the input stream can take 
> considerable time because the stream has to be drained of data before it can 
> be closed
> The new implementation of {{PositionReadable}} would issue a 
> {{GetObjectRequest}} with the range specified by {{position}} and the size of 
> the given buffer. The data would be read from the {{S3ObjectInputStream}} and 
> then closed at the end of the method. This stream would be independent of the 
> {{wrappedStream}} currently maintained by S3A.
> This brings the following benefits:
>  * The {{PositionedReadable}} methods can be thread-safe without a 
> {{synchronized}} block, which allows clients to concurrently call pread 
> methods on the same {{S3AInputStream}} instance
>  * preads will request all the data at once rather than requesting it in 
> chunks via the readahead logic
>  * Avoids performing potentially expensive seeks when performing preads



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to