[PR] [spark] Add export parquet procedure [paimon]

via GitHub Thu, 14 May 2026 01:44:24 -0700


MonsterChenzhuo opened a new pull request, #7853:
URL: https://github.com/apache/paimon/pull/7853


   ### Purpose
   
   This PR adds a Spark procedure `sys.export_parquet` to export Paimon table 
data directly to an external Parquet directory.
   
   The target scenario is exporting offline feature tables with extremely wide 
schemas, for example 20k+ columns, into an object storage directory:
   
   ```text
   s3://bucket/export/job_id/
     part-00000.parquet
     part-00001.parquet
     ...
     _SUCCESS
    
   Using Spark SQL/DataFrame write for this kind of export can spend a long 
time in Spark Catalyst optimizer when building a very wide Project, especially 
around repeated constraint / expression set computation. This procedure avoids 
building a Spark SQL wide projection and reads Paimon splits directly, then 
writes Parquet files from Spark tasks.
   
   This is similar in spirit to a bulkload/export utility: it uses Spark only 
as the distributed execution engine, while the row filtering, column 
projection, split reading, and Parquet writing are handled directly through 
Paimon APIs.
   ### Tests
   
paimon-spark/paimon-spark-ut/src/test/scala/org/apache/paimon/spark/procedure/ExportParquetProcedureTest.scala


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [spark] Add export parquet procedure [paimon]

Reply via email to