Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/455#issuecomment-44450061
Hey Nick, I finally had time to do a first pass through this. This looks
very good -- it's great how broad a set of classes it supports out of the box.
I made some notes on the PR, but also a few high-level comments:
* It would be good to move the sequence file tests to `tests.py` instead of
doctests since these are fairly involved.
* `hadoopRDD`, `hadoopFile`, etc also need tests. You can pass
SequenceFileInputFormat or TextInputFormat to those and just check that they
work.
* It would be good to add a few tests where the key or value class passed
is wrong, just to check that we have a sensible error message and don't
completely crash.
* What are your thoughts on letting people specify their own wrappers?
Maybe we can make a Java / Scala interface called `PythonConverter` or
something that users can implement, and then they'd pass a class name for a
custom PythonConverter. We should just make sure this is a stable API we can
support in the future.
* I like automatically inferring the data type for SequenceFiles, as it's
very Python-like, but maybe we shouldn't do it for other types of hadoop
datasets at first just in case those InputFormats need to have the key and
value class set in the Configuration. I'm not 100% sure how this works though
(whether some InputFormats look at those). SequenceFile has that info in the
file itself and ignores what you pass in.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---