[
https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated HADOOP-13203:
--------------------------------------
Attachment: stream_stats.tar.gz
HADOOP-13203-branch-2-004.patch
There is a corner case, wherein closing the stream should make use of
{{requestedStreamLen}} instead of {{contentLength}} to avoid connection abort.
This would be visible in long running services in the cluster tries to access
this codepath. Fixed this in the latest patch.
Also, got the stream access profiles for couple of TPC-DS and TPC-H queries,
wherein I printed the stream statistics during close in the cluster where i
tested it. Attaching those logs here with. Please note that this was done with
ORC data format which tries to read the footer and then starts reading the
stripe information.
1. In TPC-DS most of the files are small so they end up having single backwards
seeks during file reading. I.e Reader reads
the postscript/footer/meta details as the first operation and then seeks
backwards to read the data portion of the file. Without the patch, it would
abort the connection as the difference between file length and the current
position would be much higher than CLOSE_THRESHOLD.
e.g log
{noformat}2016-06-15 09:00:31,546 [INFO] [TezChild] |s3a.S3AFileSystem|:
S3AInputStream{s3a://xyz/tpcds_bin_partitioned_orc_200.db/store_sales/ss_sold_date_sk=2450967/000456_0
pos=4162453 nextReadPos=4162453 contentLength=7630589
StreamStatistics{OpenOperations=4, CloseOperations=4, Closed=4, Aborted=0,
SeekOperations=3, ReadExceptions=0, ForwardSeekOperations=2,
BackwardSeekOperations=1, BytesSkippedOnSeek=5963,
BytesBackwardsOnSeek=7629525, BytesRead=740946, BytesRead excluding
skipped=734983, ReadOperations=91, ReadFullyOperations=0, ReadsIncomplete=85}}
{noformat}
There are file accesses without any backward seeks, where in they access
standard 16KB information to read the footer details and closes the file
without any additional reads.
e.g log
{noformat}
2016-06-15 09:00:28,590 [INFO] [TezChild] |s3a.S3AFileSystem|:
S3AInputStream{s3a://xyz/tpcds_bin_partitioned_orc_200.db/store_sales/ss_sold_date_sk=2450993/000213_0
pos=7549954 nextReadPos=7549954 contentLength=7549954
StreamStatistics{OpenOperations=1, CloseOperations=1, Closed=1, Aborted=0,
SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0,
BackwardSeekOperations=0, BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0,
BytesRead=16384, BytesRead excluding skipped=16384, ReadOperations=1,
ReadFullyOperations=0, ReadsIncomplete=0}}
{noformat}
2. In TPC-H dataset, relatively large files are present (e.g each file in
lineitem dataset would be around 1 GB in size in the overall 1 TB tpc-h
dataset). In such cases, equal amount of forward-seeks and backward-seeks
happen (e.g around 24 times in per file in the log). Patch avoids connection
aborts with backward seeks.
e.g log
{noformat}
2016-06-15 09:26:26,671 [INFO] [TezChild] |s3a.S3AFileSystem|:
S3AInputStream{s3a://xyz/tpch_flat_orc_1000.db/lineitem/000041_0 pos=728756230
nextReadPos=728756230 contentLength=739566852
StreamStatistics{OpenOperations=72, CloseOperations=72, Closed=72, Aborted=0,
SeekOperations=48, ReadExceptions=0, ForwardSeekOperations=24,
BackwardSeekOperations=24, BytesSkippedOnSeek=167662,
BytesBackwardsOnSeek=737556392, BytesRead=244894978, BytesRead excluding
skipped=244727316, ReadOperations=28457, ReadFullyOperations=0,
ReadsIncomplete=28217}}
{noformat}
> S3a: Consider reducing the number of connection aborts by setting correct
> length in s3 request
> ----------------------------------------------------------------------------------------------
>
> Key: HADOOP-13203
> URL: https://issues.apache.org/jira/browse/HADOOP-13203
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Priority: Minor
> Attachments: HADOOP-13203-branch-2-001.patch,
> HADOOP-13203-branch-2-002.patch, HADOOP-13203-branch-2-003.patch,
> HADOOP-13203-branch-2-004.patch, stream_stats.tar.gz
>
>
> Currently file's "contentLength" is set as the "requestedStreamLen", when
> invoking S3AInputStream::reopen(). As a part of lazySeek(), sometimes the
> stream had to be closed and reopened. But lots of times the stream was closed
> with abort() causing the internal http connection to be unusable. This incurs
> lots of connection establishment cost in some jobs. It would be good to set
> the correct value for the stream length to avoid connection aborts.
> I will post the patch once aws tests passes in my machine.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]