[
https://issues.apache.org/jira/browse/HADOOP-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825610#comment-16825610
]
Sahil Takiar commented on HADOOP-16241:
---------------------------------------
Thanks for taking a look a this [[email protected]], always appreciate any
feedback. I'm out of town at the moment, so won't be able to respond in full,
but I have the Impala traces. I set the S3A log level to DEBUG, which includes
logging of the stream stats when the file stream is closed.
The file {{Impala-web_returns-scan-logs.txt}} contains the impalad logs a full
table scan of a 10 TB Parquet dataset on S3. The table is partitioned and is
58.1 GB. The actual files are anywhere from less than 1 MB to ~20 MB.
The file {{Impala-store_returns-scan-logs.txt}} contains the impalad logs a
full table scan of a 10 TB Parquet dataset on S3. The table is partitioned and
is 167 GB. The actual files are anywhere from less than 1 MB to ~85 MB.
Some background on how Impala scans data from S3. Impala doesn't perform any
backwards seeks (the only time it ever seeks is after it opens a file), which
is why {{BackwardSeekOperations}} is always 0. All scans (including footer
scans) are done on a dedicated file handle and then the file handle is closed.
I think the fundamental issue with how Impala currently scans data is since
fadvise = NORMAL by default, and no backwards seeks are performed, the switch
to fadvise = RANDOM never happens for Parquet. So each column chunk scan
essentially requires opening an S3 file with the full content range requested,
which eventually causes the HTTP connection to get reset when the file handle
is closed.
> S3AInputStream PositionReadable should perform ranged read on dedicated
> stream
> -------------------------------------------------------------------------------
>
> Key: HADOOP-16241
> URL: https://issues.apache.org/jira/browse/HADOOP-16241
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/s3
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
>
> The current implementation of {{PositionReadable}} in {{S3AInputStream}} is
> pretty close to the default implementation in {{FsInputStream}}.
> This JIRA proposes overriding the {{read(long position, byte[] buffer, int
> offset, int length)}} method and re-implementing the {{readFully(long
> position, byte[] buffer, int offset, int length)}} method in S3A.
> The new implementation would perform a "ranged read" on a dedicated object
> stream (rather than the shared one). Prototypes have shown this to bring a
> considerable performance improvement to readers who are only interested in
> reading a random chunk of the file at a time (e.g. Impala, although I would
> assume HBase would benefit from this as well).
> Setting {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} is helpful for
> clients that rely on pread, but has a few drawbacks:
> * Unless the client explicitly sets fadvise to RANDOM, they will get at
> least one connection reset when the backwards seek is issued (after which
> fadvise automatically switches to RANDOM)
> * Data is only read in 64 kb chunks, so for a large read, several GET
> requests must be issued to S3 to fetch the data; while the 64 kb chunk value
> is configurable, it is hard to set a reasonable value for variable length
> preads
> * If the readahead value is too big, closing the input stream can take
> considerable time because the stream has to be drained of data before it can
> be closed
> The new implementation of {{PositionReadable}} would issue a
> {{GetObjectRequest}} with the range specified by {{position}} and the size of
> the given buffer. The data would be read from the {{S3ObjectInputStream}} and
> then closed at the end of the method. This stream would be independent of the
> {{wrappedStream}} currently maintained by S3A.
> This brings the following benefits:
> * The {{PositionedReadable}} methods can be thread-safe without a
> {{synchronized}} block, which allows clients to concurrently call pread
> methods on the same {{S3AInputStream}} instance
> * preads will request all the data at once rather than requesting it in
> chunks via the readahead logic
> * Avoids performing potentially expensive seeks when performing preads
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]