Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11242#discussion_r56968395
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala ---
    @@ -62,7 +64,23 @@ class UnionRDD[T: ClassTag](
         var rdds: Seq[RDD[T]])
       extends RDD[T](sc, Nil) {  // Nil since we implement getDependencies
     
    +  // Evaluate partitions in parallel. Partitions of each rdd will be 
cached by the `partitions`
    +  // val in `RDD`.
    +  private[spark] lazy val parallelPartitionEval: Boolean = {
    --- End diff --
    
    Reuse wouldn't cause concurrent access to an `InputFormat` instance by 
itself. I thought the issue here was use for reading/writing data rather than 
`getSplits`. I could be wrong that MR would have two map tasks reading/writing 
in one JVM at the same time, but in any event, it's actually on Spark that 
matters here. And RDDs' `InputFormat` can definitely be accessed concurrently 
in Spark, and map tasks from the same RDD can run in the same JVM.
    
    I guess the point is: _if_ this is a problem, this doesn't resolve it. More 
immediately, there are cases where the same _instance_ of `InputFormat` is 
accessed concurrently. I don't think we've heard of any problems to date though?
    
    I can't see working around this kind of implementation problem. It's not 
just that the implementation assumes no concurrent access to an instance, but 
it would have to assume no concurrent access to any two instances at the same 
time. There are a hundred lesser sins I could see working around before this.
    
    Is there a specific known issue this is working around? I'm not following 
why this would ever be the right thing to do. Turning off this flag still 
leaves you with more serious silent errors elsewhere. I think this can be as 
simple as parallelizing the outer loop with no additional problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to