Maksim Konstantinov created PARQUET-2430:
--------------------------------------------

             Summary: Add ParquetJoiner feature
                 Key: PARQUET-2430
                 URL: https://issues.apache.org/jira/browse/PARQUET-2430
             Project: Parquet
          Issue Type: New Feature
          Components: parquet-hadoop
            Reporter: Maksim Konstantinov


*Overview*
ParquetJoiner feature is similar to ParquetRewrite class. ParquetRewrite allows 
to stitch files with the same schema into a single file while ParquetJoiner 
should enable stitching files with different schemas into a single file. That 
is possible when: 1) the number of rows in the main and extra files is the 
same, 2) the ordering of rows in the main and extra files is the same. Main 
benefit of ParquetJoiner is performance, for the cases when you join/stitch 
Terabytes/Petabytes of data that seemingly simple low level API can be up to 
10x more resource efficient. 

*Implementation details*
ParquetJoiner allows to specify the main input parquet file and extra input 
parquet files. ParquetJoiner will copy the main input as binary data and write 
extra input files with row groups adjusted to the main input. If main input is 
much larger than extra inputs then a lot of resources will be saved by working 
with the main input as binary.

*Use-case examples*
A very large Parquet based dataset(dozens or hundreds of fields/Terabytes of 
data daily/Petabytes of historical partitions). The task is to modify a column 
or add a new column to it for all the historic data. It is trivial using Spark, 
but taking into consideration the share scale of a dataset it will take a lot 
of resources to do that. 

*Side notes*
Note that this class of problems could be in theory solved by storing main 
input and extra inputs in HMS/Iceberg bucketed tables and use a view that joins 
those tables on the fly into the final version but in practice there is often a 
requirement to merge parquet files and have a single parquet sources in the 
file system.

*Use-case implementation details using Apache Spark*
You can use Apache Spark to perform the join with ParquetJoiner, read the large 
main input and prepare the right side of a join in a way that each file on the 
left have a corresponding file on the right and it preserves records ordering 
on the right side in the same order as on the left side, that allows the whole 
input on the left and right to have the same number of files and the same 
number of records in corresponding files and the same ordering of records in 
each file pair. Then run ParquetJoiner in parallel for each file pair and 
perform a join. [Code sample to be provided separately?]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to