Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/455#issuecomment-44450061
  
    Hey Nick, I finally had time to do a first pass through this. This looks 
very good -- it's great how broad a set of classes it supports out of the box. 
I made some notes on the PR, but also a few high-level comments:
    * It would be good to move the sequence file tests to `tests.py` instead of 
doctests since these are fairly involved.
    * `hadoopRDD`, `hadoopFile`, etc also need tests. You can pass 
SequenceFileInputFormat or TextInputFormat to those and just check that they 
work.
    * It would be good to add a few tests where the key or value class passed 
is wrong, just to check that we have a sensible error message and don't 
completely crash.
    * What are your thoughts on letting people specify their own wrappers? 
Maybe we can make a Java / Scala interface called `PythonConverter` or 
something that users can implement, and then they'd pass a class name for a 
custom PythonConverter. We should just make sure this is a stable API we can 
support in the future.
    * I like automatically inferring the data type for SequenceFiles, as it's 
very Python-like, but maybe we shouldn't do it for other types of hadoop 
datasets at first just in case those InputFormats need to have the key and 
value class set in the Configuration. I'm not 100% sure how this works though 
(whether some InputFormats look at those). SequenceFile has that info in the 
file itself and ignores what you pass in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to