Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/21698
Ok, it seems like the proposal @squito had to sort on the
binary/serialized data seems like at least a good short term solution. any
sorting is going to definitely add overhead but at least its not
dataloss/corruption. Did anyone see issues with that?
another solution would be to have another partitioner that somehow deals
with the skew, not sure on the details of that though as it might need sampling
or something else to work.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]