[ 
https://issues.apache.org/jira/browse/HADOOP-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216619#comment-17216619
 ] 

Mukund Thakur commented on HADOOP-17296:
----------------------------------------

Hi [~snvijaya]

We would understand the runtimes better if you could answer few questions on 
the experiment done:

1) How many worker threads per spark process?

2) What type of data parquet/orc?

3) Size of TPCD datasets ? 

4) If you could share more information about the Job1 to Job4? Also if we can 
extract the query planning time separately, it would be easier to compare the 
read times.  

 

Thanks. 

 

> ABFS: Allow Random Reads to be of Buffer Size
> ---------------------------------------------
>
>                 Key: HADOOP-17296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17296
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>    Affects Versions: 3.3.0
>            Reporter: Sneha Vijayarajan
>            Assignee: Sneha Vijayarajan
>            Priority: Major
>              Labels: abfsactive
>
> ADLS Gen2/ABFS driver is optimized to read only the bytes that are requested 
> for when the read pattern is random. 
> It was observed in some spark jobs that though the reads are random, the next 
> read doesn't skip by a lot and can be served by the earlier read if read was 
> done in buffer size. As a result the job triggered a higher count of read 
> calls/higher IOPS, resulting in higher IOPS throttling and hence resulted in 
> higher job runtime.
> When these jobs were run against Gen1 which always reads in buffer size , the 
> jobs fared well. 
> This Jira attempts to get a Gen1 customer migrating to Gen2 get the same 
> overall i/o pattern as gen1 and the same perf characteristics.
> *+Stats from Customer Job:+*
>  
> |*Customer Job*|*Gen 1 timing*|*Gen 2 Without patch*|*Gen2 with patch and 
> RAH=0*|
> |Job1|2 h 47 m|3 h 45 m|2 h 27 mins|
> |Job2|2 h 17 m|3 h 24 m|2 h 39 mins|
> |Job3|3 h 16 m|4 h 29 m|3 h 21 mins|
> |Job4|1 h 59 m|3 h 12 m|2 h 28 mins|
>  
> *+Stats from Internal TPCDs runs+* 
> [Total number of TPCDs queries per suite run = 80  
> Full suite repeat run count per config = 3]
> | |*Gen1*|Gen2 Without patch|*Gen2 With patch and RAH=0*
> *(Gen2 in Gen1 config)*|*Gen2 With patch and RAH=2*|
> |%Run Duration|100|140|213|70-90|
> |%Read IOPS|100|106|98|110-115|
>  
> *Without patch = default Jar with random read logic
> *With patch=Modified Jar with change to always read buffer size
> *RAH=ReadAheadQueueDepth
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to