[
https://issues.apache.org/jira/browse/HADOOP-11570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316694#comment-14316694
]
Steve Loughran commented on HADOOP-11570:
-----------------------------------------
OK. Would there be any benefit in having the choice of action move from {{pos
== contentLength}} to ({{(contentLength -pos) <= threshold) with some small
threshold like 1-4K. That way, it'd be cleaner to close files near the end of
the stream.
I don't have enough stats on single-JVM read operations to know if that has any
benefit. Making forward seek() operations more efficient is more critical, as
the general sequence of an analytics read of a column structured format (ORC)
is:
# open blob
# seek to start of "block"/allocated subset of data
# read through with skipping of regions that don't contain columns or ranges of
interest
# stop at the end of their allocated data subset.
# close the stream
This patch will address the close stream operation.
> S3AInputStream.close() downloads the remaining bytes of the object from S3
> --------------------------------------------------------------------------
>
> Key: HADOOP-11570
> URL: https://issues.apache.org/jira/browse/HADOOP-11570
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.6.0
> Reporter: Dan Hecht
> Attachments: HADOOP-11570-001.patch
>
>
> Currently, S3AInputStream.close() calls S3Object.close(). But,
> S3Object.close() will read the remaining bytes of the S3 object, potentially
> transferring a lot of bytes from S3 that are discarded. Instead, the wrapped
> stream should be aborted to avoid transferring discarded bytes (unless the
> preceding read() finished at contentLength). For example, reading only the
> first byte of a 1 GB object and then closing the stream will result in all 1
> GB transferred from S3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)