[ 
https://issues.apache.org/jira/browse/SPARK-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455238#comment-15455238
 ] 

Apache Spark commented on SPARK-17360:
--------------------------------------

User 'Stibbons' has created a pull request for this issue:
https://github.com/apache/spark/pull/14918

> PySpark can create dataframe from a Python generator
> ----------------------------------------------------
>
>                 Key: SPARK-17360
>                 URL: https://issues.apache.org/jira/browse/SPARK-17360
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Semet
>            Priority: Trivial
>
> It looks like one can create a dataframe from a Python generator, which might 
> be more efficient that by creating the list of row and use createDataframe:
> {code}
> >>> # On Python 3, you want to use "range" on the following line
> >>> d = ({'name': 'Alice-{}'.format(i), 'age': i} for i in xrange(0, 
> >>> 10000000))
> >>> d  # Please note that 'd' is a generator and not a structure with the 
> >>> 10000000 elements.
> <generator object <genexpr> at 0x7f1234b92af0>
> >>> sqlContext.createDataFrame(d).take(5)
> [Row(age=1, name=u'Alice-1')]
> [Row(age=2, name=u'Alice-2')]
> [Row(age=3, name=u'Alice-3')]
> [Row(age=4, name=u'Alice-4')]
> [Row(age=5, name=u'Alice-5')]
> {code}
> Looking at the code, there is nothing important to change in the code, only 
> doc and unit tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to