steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-535903963 > it seems like the sweet spot is somewhere between 20-30 threads (for my environment anyways, 2015 macbook pro, i7/w 8 cores). interesting. You may get different numbers running in EC2; it's always best to benchmark perf there. Remote dev amplifies some performance issues (cost of reopening an http connection, general latency) while hiding others (how easy it is for spark jobs to overload s3 shards and so get throttled, cause delays, trigger speculative task execution, more throttling, etc, etc) Try changing "fs.s3a.connection.maximum" from the default of 48 to something bigger. That's the limit on the http pool size. It's a small number to stop a single s3a instance from overloading the system, but you may want to consider. There's also "fs.s3a.max.total.tasks" which controls the thread pool size used for background writing of blocks of large files; in hadoop trunk parallel delete/rename operations, plus stuff in the AWS SDK itself. * "fs.s3a.connection.maximum" should be > than "fs.s3a.max.total.tasks" * "fs.s3a.threads.keepalivetime" from 60 to 300 to keep those connections around for longer (avoids that https overhead) Try with some bigger numbers and see if you get the same results. Your scanning threads may just be blocking on the http connection pool for bonus fun force random IO for ORC/parquet perf, but with remote reads, set the min block to read to be 256K or bigger ``` spark.hadoop.fs.s3a.readahead.range 256K spark.hadoop.fs.s3a.input.fadvise random ``` note: Java 8's SSL default encryption is underperformant. We've been doing work there but it's too early to think about backporting it. I'm planning to do a refresh of the s3a connector for hadoop 3.2.2 which should include it (https://github.com/apache/hadoop/pull/970) For now: look at [stack overflow](https://stackoverflow.com/questions/25992131/slow-aes-gcm-encryption-and-decryption-with-java-8u20)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org