[jira] [Created] (SPARK-17790) Support for parallelizing/creating DataFrame on data larger than 2GB

Hossein Falaki (JIRA) Wed, 05 Oct 2016 14:00:31 -0700

Hossein Falaki created SPARK-17790:
--------------------------------------

             Summary: Support for parallelizing/creating DataFrame on data 
larger than 2GB
                 Key: SPARK-17790
                 URL: https://issues.apache.org/jira/browse/SPARK-17790
             Project: Spark
          Issue Type: Story
          Components: SparkR
    Affects Versions: 2.0.1
            Reporter: Hossein Falaki



This issue is a more specific version of SPARK-17762. 
Supporting larger than 2GB arguments is more general and arguably harder to do 
because the limit exists both in R and JVM (because we receive data as a 
ByteArray). However, to support parallalizing R data.frames that are larger 
than 2GB we can do what PySpark does.

PySpark uses files to transfer bulk data between Python and JVM. It has worked 
well for the large community of Spark Python users. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-17790) Support for parallelizing/creating DataFrame on data larger than 2GB

Reply via email to