[ 
https://issues.apache.org/jira/browse/HADOOP-18190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572457#comment-17572457
 ] 

Ahmar Suhail commented on HADOOP-18190:
---------------------------------------

I ran this, initial results are not encouraging. For a 512M object, download 
duration is 5s without prefetching, and 13s with prefetching. This is because 
cloudstore bandwidth uses FSInputStream's 
[readFully|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSInputStream.java#L136]
 which calls read(long position, byte[] buffer, int offset, int length). This 
read then seeks to the position (which is ok), but then seeks back to the 
original starting position at the end (so always seeking back to 0). this is 
pretty bad for the prefetching implementation as it means lots of caching to 
disk and getting blocks from disk. 

In 
[ITestS3PrefetchingInputStream|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3PrefetchingInputStream.java#L118]
 , we were using java.io.DataInputStream's readFully which doesn't seek, so 
didn't see this issue. we should implement 

readFully(long position, byte[] buffer, int offset, int length) in 
prefetchingInputStream to fix this?

> Collect IOStatistics during S3A prefetching 
> --------------------------------------------
>
>                 Key: HADOOP-18190
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18190
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.4.0
>            Reporter: Steve Loughran
>            Assignee: Ahmar Suhail
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> There is a lot more happening in reads, so there's a lot more data to collect 
> and publish in IO stats for us to view in a summary at the end of processes 
> as well as get from the stream while it is active.
> Some useful ones would seem to be:
> counters
>  * is in memory. using 0 or 1 here lets aggregation reports count total #of 
> memory cached files.
>  * prefetching operations executed
>  * errors during prefetching
> gauges
>  * number of blocks in cache
>  * total size of blocks
>  * active prefetches
> + active memory used
> duration tracking count/min/max/ave
>  * time to fetch a block
>  * time queued before the actual fetch begins
>  * time a reader is blocked waiting for a block fetch to complete
> and some info on cache use itself
>  * number of blocks discarded unread
>  * number of prefetched blocks later used
>  * number of backward seeks to a prefetched block
>  * number of forward seeks to a prefetched block
> the key ones I care about are
>  # memory consumption
>  # can we determine if cache is working (reads with cache hit) and when it is 
> not (misses, wasted prefetches)
>  # time blocked on executors
> The stats need to be accessible on a stream even when closed, and aggregated 
> into the FS. once we get per-thread stats contexts we can publish there too 
> and collect in worker threads for reporting in task commits



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to