Github user markhamstra commented on a diff in the pull request:
https://github.com/apache/spark/pull/12243#discussion_r59025794
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
---
@@ -46,37 +50,80 @@ case class PartitionedFile(
*/
case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends
Partition
+object FileScanRDD {
+ private val ioExecutionContext = ExecutionContext.fromExecutorService(
+ ThreadUtils.newDaemonCachedThreadPool("FileScanRDD", 16))
--- End diff --
Shouldn't it be the total number of cores the user is willing to dedicate
to a single Job? This looks to be similar to an issue in ParquetRelation where
a `parallelize` call can end up tying up all of the cores (defaultParallelism)
on a single Job. While this PR should allow better progress to be made during
that kind of blocking, I'm thinking that what we really need is to implement
what was suggested a while ago in the scheduling pools: a max cores limit in
addition to the current min cores. With that in place and the max cores value
exposed to these large IO operations, users who care about not blocking
concurrent Jobs can use pools that neither consume all the available cores nor
oversubscribe the cores that the pool does have.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]