MonsterChenzhuo opened a new pull request, #7853:
URL: https://github.com/apache/paimon/pull/7853
### Purpose
This PR adds a Spark procedure `sys.export_parquet` to export Paimon table
data directly to an external Parquet directory.
The target scenario is exporting offline feature tables with extremely wide
schemas, for example 20k+ columns, into an object storage directory:
```text
s3://bucket/export/job_id/
part-00000.parquet
part-00001.parquet
...
_SUCCESS
Using Spark SQL/DataFrame write for this kind of export can spend a long
time in Spark Catalyst optimizer when building a very wide Project, especially
around repeated constraint / expression set computation. This procedure avoids
building a Spark SQL wide projection and reads Paimon splits directly, then
writes Parquet files from Spark tasks.
This is similar in spirit to a bulkload/export utility: it uses Spark only
as the distributed execution engine, while the row filtering, column
projection, split reading, and Parquet writing are handled directly through
Paimon APIs.
### Tests
paimon-spark/paimon-spark-ut/src/test/scala/org/apache/paimon/spark/procedure/ExportParquetProcedureTest.scala
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]