[ 
https://issues.apache.org/jira/browse/HADOOP-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-13286:
------------------------------------
    Attachment: HADOOP-13286-branch-2-001.patch

Patch 001; streams the test data through the (presumably) non-native gz codec, 
then into LineReader. Simulates a mapper applied to a .CSV.gz file

timings
{code}
testDecompression128K: Decompress with a 128K readahead

2016-06-17 16:30:42,408 [Thread-0] INFO  compress.CodecPool 
(CodecPool.java:getDecompressor(181)) - Got brand-new decompressor [.gz]
2016-06-17 16:30:47,345 [Thread-0] INFO  contract.ContractTestUtils 
(ContractTestUtils.java:end(1262)) - Duration of Time to read 514690 lines 
[99896260 bytes expanded, 22633778 raw] with readahead = 131072: 5,107,155,982 
nS
2016-06-17 16:30:47,345 [Thread-0] INFO  scale.TestS3AInputStreamPerformance 
(TestS3AInputStreamPerformance.java:logTimePerIOP(144)) - Time per IOP: 9,922 nS
2016-06-17 16:30:47,346 [Thread-0] INFO  scale.TestS3AInputStreamPerformance 
(TestS3AInputStreamPerformance.java:logStreamStatistics(301)) - Stream 
Statistics
StreamStatistics{OpenOperations=1, CloseOperations=1, Closed=1, Aborted=0, 
SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0, 
BackwardSeekOperations=0, BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0, 
BytesRead=22633778, BytesRead excluding skipped=22633778, ReadOperations=5708, 
ReadFullyOperations=0, ReadsIncomplete=243}
{code}

that is: 1 microsecond/line; 5.1s for the entire 20MB file, which expands to 
99MB on the way through the pipeline

> add a scale test to do gunzip and linecount
> -------------------------------------------
>
>                 Key: HADOOP-13286
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13286
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-13286-branch-2-001.patch
>
>
> the HADOOP-13203 patch proposal showed that there were performance problems 
> downstream which weren't surfacing in the current scale tests.
> Trying to decompress the .gz test file and then go through it with LineReader 
> models a basic use case: parse a .csv.gz data source. 
> Add this, with metric printing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to