[
https://issues.apache.org/jira/browse/SPARK-53917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Khakhlyuk updated SPARK-53917:
-----------------------------------
Attachment: image-2025-10-15-13-56-46-840.png
> [CONNECT] Supporting large LocalRelations
> -----------------------------------------
>
> Key: SPARK-53917
> URL: https://issues.apache.org/jira/browse/SPARK-53917
> Project: Spark
> Issue Type: Improvement
> Components: Connect, PySpark
> Affects Versions: 4.1.0
> Reporter: Alex Khakhlyuk
> Priority: Major
> Fix For: 4.1.0
>
> Attachments: image-2025-10-15-13-50-04-179.png,
> image-2025-10-15-13-50-44-333.png, image-2025-10-15-13-53-40-306.png,
> image-2025-10-15-13-56-46-840.png
>
>
> LocalRelation is a Catalyst logical operator used to represent a dataset of
> rows inline as part of the LogicalPlan. LocalRelations represent dataframes
> created directly from Python and Scala objects, e.g., Python and Scala lists,
> pandas dataframes, csv files loaded in memory, etc.
> In Spark Connect, local relations are transferred over gRPC using
> LocalRelation (for relations under 64MB) and CachedLocalRelation (larger
> relations over 64MB) messages.
> CachedLocalRelations currently have a hard size limit of 2GB, which means
> that spark users can’t execute queries with local client data, pandas
> dataframes, csv files of over 2GB.
> I propose removing this limit by introducing a new proto message for
> transferring large LocalRelations in chunks and by adding batch processing of
> the data both on the client (python and scala) and on the server.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]