Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/2287#issuecomment-54636076
Hi @wardviaene,
Do you have an example program that reproduces this bug? We should
probably add it as a regression test (see `python/pyspark/tests.py` for
examples of how to do this).
(For other reviewers: you can browse SerializingAdapter's code at
http://pydoc.net/Python/cloud/2.7.0/cloud.transport.adapter/) It looks like
this code is designed to handle the pickling of file() objects. The Dill
developers have recently been discussing how to pickle file handles:
https://github.com/uqfoundation/dill/issues/57
It looks like `SerializingAdapter.max_transmit_data` acts as an upper-limit
on the sizes of closures that PiCloud would send to their service. Unlike
PiCloud, we don't have limits on closure sizes (there are warnings, but these
are detected / enforced inside the JVM). Therefore, I wonder if we should just
remove this limit and allow the whole file to be read rather than adding an
obscure configuration option.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]