[
https://issues.apache.org/jira/browse/SPARK-38111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fabien updated SPARK-38111:
---------------------------
Description:
Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow
batches ?
I have a pretty large dataset on my cluster so I cannot collect it using
[collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--]
which download every thing at once and saturate my JVM memory
Seeing that Arrow is becoming a standard to transfer large datasets and that
Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with
Arrow batches ?
This would be ideal to process the data batch per batch and avoid saturating
the memory.
I am looking for an API like this (in Java)
{code:java}
var stream = dataframe.collectAsArrowStream()
while (stream.hasNextBatch()) {
var batch = stream.getNextBatch()
// do some stuff with the arrow batch
}
{code}
It would be even better if I can split the dataframe into several streams so I
can download and process it in parallel
was:
Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow
batches ?
I have a pretty large dataset on my cluster so I cannot collect it using
[collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--]
which download every thing at once and saturate the my JVM memory
Seeing that Arrow is becoming a standard to transfer large datasets and that
Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with
Arrow batches ?
This would be ideal to process the data batch per batch and avoid saturating
the memory.
I am looking for an API like this (in Java)
{code:java}
var stream = dataframe.collectAsArrowStream()
while (stream.hasNextBatch()) {
var batch = stream.getNextBatch()
// do some stuff with the arrow batch
}
{code}
It would be even better if I can split the dataframe into several streams so I
can download and process it in parallel
> Retrieve a Spark dataframe as Arrow batches
> -------------------------------------------
>
> Key: SPARK-38111
> URL: https://issues.apache.org/jira/browse/SPARK-38111
> Project: Spark
> Issue Type: Question
> Components: Java API
> Affects Versions: 3.2.0
> Environment: Java 11
> Spark 3
> Reporter: Fabien
> Priority: Minor
>
> Using the Java API, is there a way to efficiently retrieve a dataframe as
> Arrow batches ?
> I have a pretty large dataset on my cluster so I cannot collect it using
> [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--]
> which download every thing at once and saturate my JVM memory
> Seeing that Arrow is becoming a standard to transfer large datasets and that
> Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with
> Arrow batches ?
> This would be ideal to process the data batch per batch and avoid saturating
> the memory.
>
> I am looking for an API like this (in Java)
>
> {code:java}
> var stream = dataframe.collectAsArrowStream()
> while (stream.hasNextBatch()) {
> var batch = stream.getNextBatch()
> // do some stuff with the arrow batch
> }
> {code}
> It would be even better if I can split the dataframe into several streams so
> I can download and process it in parallel
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]