[jira] [Created] (SPARK-38111) Retrieve a Spark dataframe as Arrow batches

Fabien (Jira) Fri, 04 Feb 2022 09:16:07 -0800

Fabien created SPARK-38111:
------------------------------

             Summary: Retrieve a Spark dataframe as Arrow batches
                 Key: SPARK-38111
                 URL: https://issues.apache.org/jira/browse/SPARK-38111
             Project: Spark
          Issue Type: Question
          Components: Java API
    Affects Versions: 3.2.0
         Environment: Java 11


Spark 3
            Reporter: Fabien


Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow 
batches ?

I have a pretty large dataset on my cluster so I cannot collect it using 
[collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--]
 which download every thing at once and saturate the my JVM memory

Seeing that Arrow is becoming a standard to transfer large datasets and that 
Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with 
Arrow batches ?

This would be ideal to process the data batch per batch and avoid saturating 
the memory.
 

I am looking for an API like this (in Java)

 
{code:java}
var stream = dataframe.collectAsArrowStream()
while (stream.hasNextBatch()) {
    var batch = stream.getNextBatch()
    // do some stuff with the arrow batch
}
{code}

It would be even better if I can split the dataframe into several streams so I 
can download and process it in parallel



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-38111) Retrieve a Spark dataframe as Arrow batches

Reply via email to