Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/266#issuecomment-40273037
I did not notice this earlier.
The toByteArray method is insanely expensive for anything nontrivial.
A better solution would be to replace use of ByteArrayOutputStream with an
inhouse variant which allows us direct access to the byte[] - if we dont want
to use fastutil.
Already we are hitting cases of the byteoutputstream failing due to 2G
limit.
This PR will make us create two copies of the same : the performance
implication of this is terrible
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---