GitHub user mulby opened a pull request:

    https://github.com/apache/spark/pull/3978

    [SPARK-5138][SQL] Ensure schema can be inferred from a namedtuple

    When attempting to infer the schema of an RDD that contains namedtuples, 
pyspark fails to identify the records as namedtuples, resulting in it raising 
an error.
    
    Example:
    
    ```python
    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from collections import namedtuple
    import os
    
    sc = SparkContext()
    rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
    TextLine = namedtuple('TextLine', 'line length')
    tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
    tuple_rdd.take(5)  # This works
    
    sqlc = SQLContext(sc)
    
    # The following line raises an error
    schema_rdd = sqlc.inferSchema(tuple_rdd)
    ```
    
    The error raised is:
    ```
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, 
in main
        process()
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, 
in process
        serializer.dump_stream(func(split_index, iterator), outfile)
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 
227, in dump_stream
        vs = list(itertools.islice(iterator, batch))
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, 
in takeUpToNumLeft
        yield next(iterator)
      File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in 
convert_struct
        raise ValueError("unexpected tuple: %s" % obj)
    TypeError: not all arguments converted during string formatting
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mulby/spark inferschema-namedtuple

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3978.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3978
    
----
commit 375d96b3c6a7c8035f464f7c5f72bef1951f564b
Author: Gabe Mulley <[email protected]>
Date:   2015-01-09T13:48:15Z

    Ensure schema can be inferred from a namedtuple

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to