Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-19 Thread Nick Pentreath
Hi Matei I'm afraid I haven't had enough time to focus on this as work has just been crazy. It's still something I want to get to a mergeable status.  Actually it was working fine it was just a bit rough and needs to be updated to HEAD. I'll absolutely try my utmost to get something

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-19 Thread Matei Zaharia
Hey Nick, no worries if this can’t be done in time. It’s probably better to test it thoroughly. If you do have something partially working though, the main concern will be the API, i.e. whether it’s an API we want to support indefinitely. It would be bad to add this and then make major changes

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-19 Thread Nick Pentreath
Ok - I'll work something up and reopen a PR against the new spark mirror. The API itself mirrors the newHadoopFile etc methods, so that should be quite stable once finalised. It's the wrapper stuff of how to serialize custom classes and read them in Python that is the potential tricky

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-18 Thread Matei Zaharia
Hey Nick, I’m curious, have you been doing any further development on this? It would be good to get expanded InputFormat support in Spark 1.0. To start with we don’t have to do SequenceFiles in particular, we can do stuff like Avro (if it’s easy to read in Python) or some kind of