steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-535903963
 
 
   >  it seems like the sweet spot is somewhere between 20-30 threads (for my 
environment anyways, 2015 macbook pro, i7/w 8 cores).
   
   interesting. You may get different numbers running in EC2; it's always best 
to benchmark perf there. Remote dev amplifies some performance issues (cost of 
reopening an http connection, general latency) while hiding others (how easy it 
is for spark jobs to overload s3 shards and so get throttled, cause delays, 
trigger speculative task execution, more throttling, etc, etc)
   
   Try changing "fs.s3a.connection.maximum" from the default of 48 to something 
bigger. That's the limit on the http pool size. It's a small number to stop a 
single s3a instance from overloading the system, but you may want to consider. 
There's also "fs.s3a.max.total.tasks" which controls the thread pool size used 
for background writing of blocks of large files; in hadoop trunk parallel 
delete/rename operations, plus stuff in the AWS SDK itself.
   
   * "fs.s3a.connection.maximum" should be > than "fs.s3a.max.total.tasks"
   * "fs.s3a.threads.keepalivetime" from 60 to 300 to keep those connections 
around for longer (avoids that https overhead)
   
   Try with some bigger numbers and see if you get the same results. Your 
scanning threads may just be blocking on the http connection pool
   
   for bonus fun force random IO for ORC/parquet perf, but with remote reads, 
set the min block to read to be 256K or bigger
   
   ```
   spark.hadoop.fs.s3a.readahead.range 256K
   spark.hadoop.fs.s3a.input.fadvise random
   ```
   
   note: Java 8's SSL default encryption is underperformant. We've been doing 
work there but it's too early to think about backporting it. I'm planning to do 
a refresh of the s3a connector for hadoop 3.2.2 which should include it 
(https://github.com/apache/hadoop/pull/970)
   For now: look at [stack 
overflow](https://stackoverflow.com/questions/25992131/slow-aes-gcm-encryption-and-decryption-with-java-8u20)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to