Hi Spark Devs I was wondering what appetite there may be to add the ability for PySpark users to create RDDs from (somewhat) arbitrary Hadoop InputFormats.
In my data pipeline for example, I'm currently just using Scala (partly because I love it but also because I am heavily reliant on quite custom Hadoop InputFormats for reading data). However, many users may prefer to use PySpark as much as possible (if not for everything). Reasons might include the need to use some Python library. While I don't do it yet, I can certainly see an attractive use case for using say scikit-learn / numpy to do data analysis & machine learning in Python. Added to this my cofounder knows Python well but not Scala so it can be very beneficial to do a lot of stuff in Python. For text-based data this is fine, but reading data in from more complex Hadoop formats is an issue. The current approach would of course be to write an ETL-style Java/Scala job and then process in Python. Nothing wrong with this, but I was thinking about ways to allow Python to access arbitrary Hadoop InputFormats. Here is a quick proof of concept: https://gist.github.com/MLnick/7150058 This works for simple stuff like SequenceFile with simple Writable key/values. To work with more complex files, perhaps an approach is to manipulate Hadoop JobConf via Python and pass that in. The one downside is of course that the InputFormat (well actually the Key/Value classes) must have a toString that makes sense so very custom stuff might not work. I wonder if it would be possible to take the objects that are yielded via the InputFormat and convert them into some representation like ProtoBuf, MsgPack, Avro, JSON, that can be read relatively more easily from Python? Another approach could be to allow a simple "wrapper API" such that one can write a wrapper function T => String and pass that into an InputFormatWrapper that takes an arbitrary InputFormat and yields Strings for the keys and values. Then all that is required is to compile that function and add it to the SPARK_CLASSPATH and away you go! Thoughts? Nick