GitHub user mulby opened a pull request:
https://github.com/apache/spark/pull/3978
[SPARK-5138][SQL] Ensure schema can be inferred from a namedtuple
When attempting to infer the schema of an RDD that contains namedtuples,
pyspark fails to identify the records as namedtuples, resulting in it raising
an error.
Example:
```python
from pyspark import SparkContext
from pyspark.sql import SQLContext
from collections import namedtuple
import os
sc = SparkContext()
rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
TextLine = namedtuple('TextLine', 'line length')
tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
tuple_rdd.take(5) # This works
sqlc = SQLContext(sc)
# The following line raises an error
schema_rdd = sqlc.inferSchema(tuple_rdd)
```
The error raised is:
```
File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107,
in main
process()
File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98,
in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line
227, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107,
in takeUpToNumLeft
yield next(iterator)
File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in
convert_struct
raise ValueError("unexpected tuple: %s" % obj)
TypeError: not all arguments converted during string formatting
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mulby/spark inferschema-namedtuple
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3978.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3978
----
commit 375d96b3c6a7c8035f464f7c5f72bef1951f564b
Author: Gabe Mulley <[email protected]>
Date: 2015-01-09T13:48:15Z
Ensure schema can be inferred from a namedtuple
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]