Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/455#issuecomment-40882224
Cool, thanks for porting this over! A few notes:
- I looked at msgpack in the past and one problem with it was that users
need to install it separately through "pip" to use PySpark. Before this, we had
no external package dependencies except NumPy for ML. For this reason it would
be good to investigate Pyrolite instead (which just uses pickling on the Python
side). If that doesn't work, we should write the code in a way that imports
msgpack only if you're using one of these methods.
- There are a bunch of binary test files included, would it be possible to
generate those programmatically instead (e.g. through saveAsSequenceFile, or
through a Java-side static method)?
- The build is failing due to the scalastyle checker; you can run sbt
scalastyle locally to do the same tests there. (See
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14259/console
for the current errors).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---