Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/455#issuecomment-45432064
@pwendell thanks for the comments. Will make those amendments.
I get that SparkSQL will be the preferred way of doing things for
structured data in general, so happy to mark this feature as advanced and
experimental, and have it superseded by the relevant SparkSQL functionality
later.
Having said this, I see this as simply adding missing API functionality to
PySpark. All it does is bring PySpark on a par with Scala/Java in terms of
being able to read from any ```InputFormat```. The vast majority of Spark users
end up using ```textFile```, with others probably using ```SequenceFile``` and
other common formats (such as HBase, Cassandra, Parquet etc). I expect this to
be the case here - most users won't generally use this functionality, but some
with more special or advanced cases will (and some users do have weird and
custom inputformats and non-structured data).
The ```Converter``` interface is only really used when the data being read
is of some special format (such as HBase and Cassandra binary data). For
example, reading Elasticsearch "just works" and uses the standard converter to
convert the Writables to primitives. So using ES with SchemaRDD is as easy as
below (and this is exactly how it would work in Scala/Java too):
```python
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
conf = {"es.resource" : "index/type"} # assume Elasticsearch is running
on localhost defaults
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",
"org.apache.hadoop.io.NullWritable",
"org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first() # the result is a MapWritable that is converted to a
Python dict
(u'Elasticsearch ID',
{u'field1': True,
u'field2': u'Some Text',
u'field3': 12345})
dicts = rdd.map(lambda x: x[1])
table = sqlCtx.inferSchema(dicts)
table.registerAsTable("table")
```
@mateiz I agree that the ```Converter``` interface is only an extension
point to be used with cases where the inputformat generates data that is not
Pickleable (or is e.g. binary data that needs to be extracted), and that we
shouldn't add any to Spark itself - the Cassandra and HBase ones are just
examples of what can be done.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---