Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/3795#issuecomment-68090285
There are a few alternative approaches that we could consider for adding
warnings / errors, but they run up against various roadblocks:
- Ideally, we could move the
`Partitioner.assertPartitionerSupportsKeyClass` call to the `ShuffledRDD`
constructor. This would ensure that new transformations benefit from these
checks, too, and minimizes the risk that someone forgets to add the call.
Unfortunately, ShuffledRDD does not have access to the key and value
ClassManifests. We can't change ShuffledRDD to capture ClassManifests because
that would break binary compatibility for a class that's marked
`@DeveloperApi`. TypeTags would also fix this issue, but we can't use those
because they would introduce binary compatibilities in public APIs.
- We could move the checks for array/enum keys to the `PairRDDFunctions`
constructor, but at that point we don't know which `Partitioner` class is being
used. If we threw an error, this would break any user code which relied on a
custom `Partitioner` subclass to properly partition array / enum keys.
- We could move these checks to runtime by reflecting on the first element
of each partition's iterator. This works around some erasure issues, but it
may be difficult to implement: there's not a great way to peek at the first
element of a Scala iterator, so we'd have to either wrap the iterator to let us
push the first element back, or move the checking logic into the low-level code
that consumes the iterator. These iterators are consumed in the hash- and
sort-shuffle writers, both of which are already fairly complicated, so I'm
concerned that this would be an invasive change.
There might be an argument in favor of never throwing warnings and only
logging ERROR-level messages when using these key classes. I think that we
should definitely throw an exception whenever we see `HashPartitioner` being
used with classes for which we know it has the wrong behavior, since this is
probably the case that users are most likely to run into (I don't think many
people use custom partitioners) and they might ignore log messages (e.g. if
they're running in an interactive notebook).
I think a reasonable compromise is to log a WARNING-level message in the
`PairRDDFunctions` constructor whenever we see an array / enum key class, and
keep the current exception-throwing logic for the `HashPartitioner` case.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]