Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3795#issuecomment-68090285
  
    There are a few alternative approaches that we could consider for adding 
warnings / errors, but they run up against various roadblocks:
    
    - Ideally, we could move the 
`Partitioner.assertPartitionerSupportsKeyClass` call to the `ShuffledRDD` 
constructor.  This would ensure that new transformations benefit from these 
checks, too, and minimizes the risk that someone forgets to add the call.  
Unfortunately, ShuffledRDD does not have access to the key and value 
ClassManifests.  We can't change ShuffledRDD to capture ClassManifests because 
that would break binary compatibility for a class that's marked 
`@DeveloperApi`.  TypeTags would also fix this issue, but we can't use those 
because they would introduce binary compatibilities in public APIs.
    - We could move the checks for array/enum keys to the `PairRDDFunctions` 
constructor, but at that point we don't know which `Partitioner` class is being 
used.  If we threw an error, this would break any user code which relied on a 
custom `Partitioner` subclass to properly partition array / enum keys.
    - We could move these checks to runtime by reflecting on the first element 
of each partition's iterator.  This works around some erasure issues, but it 
may be difficult to implement: there's not a great way to peek at the first 
element of a Scala iterator, so we'd have to either wrap the iterator to let us 
push the first element back, or move the checking logic into the low-level code 
that consumes the iterator.  These iterators are consumed in the hash- and 
sort-shuffle writers, both of which are already fairly complicated, so I'm 
concerned that this would be an invasive change.
    
    There might be an argument in favor of never throwing warnings and only 
logging ERROR-level messages when using these key classes.  I think that we 
should definitely throw an exception whenever we see `HashPartitioner` being 
used with classes for which we know it has the wrong behavior, since this is 
probably the case that users are most likely to run into (I don't think many 
people use custom partitioners) and they might ignore log messages (e.g. if 
they're running in an interactive notebook).
    
    I think a reasonable compromise is to log a WARNING-level message in the 
`PairRDDFunctions` constructor whenever we see an array / enum key class, and 
keep the current exception-throwing logic for the `HashPartitioner` case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to