[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

MLnick Sun, 08 Jun 2014 02:07:29 -0700

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/455#issuecomment-45432064
  
    @pwendell thanks for the comments. Will make those amendments.
    
    I get that SparkSQL will be the preferred way of doing things for 
structured data in general, so happy to mark this feature as advanced and 
experimental, and have it superseded by the relevant SparkSQL functionality 
later.
    
    Having said this, I see this as simply adding missing API functionality to 
PySpark. All it does is bring PySpark on a par with Scala/Java in terms of 
being able to read from any ```InputFormat```. The vast majority of Spark users 
end up using ```textFile```, with others probably using ```SequenceFile``` and 
other common formats (such as HBase, Cassandra, Parquet etc). I expect this to 
be the case here - most users won't generally use this functionality, but some 
with more special or advanced cases will (and some users do have weird and 
custom inputformats and non-structured data).
    
    The ```Converter``` interface is only really used when the data being read 
is of some special format (such as HBase and Cassandra binary data). For 
example, reading Elasticsearch "just works" and uses the standard converter to 
convert the Writables to primitives. So using ES with SchemaRDD is as easy as 
below (and this is exactly how it would work in Scala/Java too):
    
    ```python
    from pyspark.sql import SQLContext
    sqlCtx = SQLContext(sc)
    
    conf = {"es.resource" : "index/type"}   # assume Elasticsearch is running 
on localhost defaults
    rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", 
"org.apache.hadoop.io.NullWritable", 
"org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
    rdd.first()         # the result is a MapWritable that is converted to a 
Python dict
    (u'Elasticsearch ID',
     {u'field1': True,
      u'field2': u'Some Text',
      u'field3': 12345})
    
    dicts = rdd.map(lambda x: x[1])
    table = sqlCtx.inferSchema(dicts)
    table.registerAsTable("table")
    ```
    
    @mateiz I agree that the ```Converter``` interface is only an extension 
point to be used with cases where the inputformat generates data that is not 
Pickleable (or is e.g. binary data that needs to be extracted), and that we 
shouldn't add any to Spark itself - the Cassandra and HBase ones are just 
examples of what can be done.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

Reply via email to