[GitHub] [spark] steveloughran commented on pull request #30135: [SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1

GitBox Fri, 02 Jul 2021 05:53:29 -0700


steveloughran commented on pull request #30135:
URL: https://github.com/apache/spark/pull/30135#issuecomment-872974884



   > To my surprise the read is slower(with same resource and same config) in 
Hadoop 3.3.1 than Hadoop 3.2.0 without the mentioned issue. It is possible I am 
missing something.
   
   Shouldn't happen. really shouldn't happen. We do not see that on our TCP-DS 
Benchmarks.
   
   The main way I could see this happening is if the seek policy hasn't 
switched to random on the first backwards seek. Explicitly set it.
   
   ```
   spark.hadoop.fs.s3a.experimental.fadvise random
   ```
   
   
   Hadoop 3.3.1 has a  stats collection API (IOStatisics) for filesystems, 
streams, etc. 
   * call toString() on a stream to get its stats, inc #of bytes discarded in 
seeks, streams aborted
   * do the same for the FS to get the aggregate stats. 
   
   
   high counts of bytes discarded and aborts are signs of bad seek policy.    
   
   set these two logs at debug and see what they say.
   ```
   org.apache.hadoop.fs.s3a.S3AInputStream
   org.apache.hadoop.fs.s3a.S3AStorageStatistics
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on pull request #30135: [SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1

Reply via email to