[
https://issues.apache.org/jira/browse/HADOOP-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated HADOOP-13286:
------------------------------------
Attachment: HADOOP-13286-branch-2-001.patch
Patch 001; streams the test data through the (presumably) non-native gz codec,
then into LineReader. Simulates a mapper applied to a .CSV.gz file
timings
{code}
testDecompression128K: Decompress with a 128K readahead
2016-06-17 16:30:42,408 [Thread-0] INFO compress.CodecPool
(CodecPool.java:getDecompressor(181)) - Got brand-new decompressor [.gz]
2016-06-17 16:30:47,345 [Thread-0] INFO contract.ContractTestUtils
(ContractTestUtils.java:end(1262)) - Duration of Time to read 514690 lines
[99896260 bytes expanded, 22633778 raw] with readahead = 131072: 5,107,155,982
nS
2016-06-17 16:30:47,345 [Thread-0] INFO scale.TestS3AInputStreamPerformance
(TestS3AInputStreamPerformance.java:logTimePerIOP(144)) - Time per IOP: 9,922 nS
2016-06-17 16:30:47,346 [Thread-0] INFO scale.TestS3AInputStreamPerformance
(TestS3AInputStreamPerformance.java:logStreamStatistics(301)) - Stream
Statistics
StreamStatistics{OpenOperations=1, CloseOperations=1, Closed=1, Aborted=0,
SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0,
BackwardSeekOperations=0, BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0,
BytesRead=22633778, BytesRead excluding skipped=22633778, ReadOperations=5708,
ReadFullyOperations=0, ReadsIncomplete=243}
{code}
that is: 1 microsecond/line; 5.1s for the entire 20MB file, which expands to
99MB on the way through the pipeline
> add a scale test to do gunzip and linecount
> -------------------------------------------
>
> Key: HADOOP-13286
> URL: https://issues.apache.org/jira/browse/HADOOP-13286
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Attachments: HADOOP-13286-branch-2-001.patch
>
>
> the HADOOP-13203 patch proposal showed that there were performance problems
> downstream which weren't surfacing in the current scale tests.
> Trying to decompress the .gz test file and then go through it with LineReader
> models a basic use case: parse a .csv.gz data source.
> Add this, with metric printing
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]