Maksim Konstantinov created PARQUET-2430:
--------------------------------------------
Summary: Add ParquetJoiner feature
Key: PARQUET-2430
URL: https://issues.apache.org/jira/browse/PARQUET-2430
Project: Parquet
Issue Type: New Feature
Components: parquet-hadoop
Reporter: Maksim Konstantinov
*Overview*
ParquetJoiner feature is similar to ParquetRewrite class. ParquetRewrite allows
to stitch files with the same schema into a single file while ParquetJoiner
should enable stitching files with different schemas into a single file. That
is possible when: 1) the number of rows in the main and extra files is the
same, 2) the ordering of rows in the main and extra files is the same. Main
benefit of ParquetJoiner is performance, for the cases when you join/stitch
Terabytes/Petabytes of data that seemingly simple low level API can be up to
10x more resource efficient.
*Implementation details*
ParquetJoiner allows to specify the main input parquet file and extra input
parquet files. ParquetJoiner will copy the main input as binary data and write
extra input files with row groups adjusted to the main input. If main input is
much larger than extra inputs then a lot of resources will be saved by working
with the main input as binary.
*Use-case examples*
A very large Parquet based dataset(dozens or hundreds of fields/Terabytes of
data daily/Petabytes of historical partitions). The task is to modify a column
or add a new column to it for all the historic data. It is trivial using Spark,
but taking into consideration the share scale of a dataset it will take a lot
of resources to do that.
*Side notes*
Note that this class of problems could be in theory solved by storing main
input and extra inputs in HMS/Iceberg bucketed tables and use a view that joins
those tables on the fly into the final version but in practice there is often a
requirement to merge parquet files and have a single parquet sources in the
file system.
*Use-case implementation details using Apache Spark*
You can use Apache Spark to perform the join with ParquetJoiner, read the large
main input and prepare the right side of a join in a way that each file on the
left have a corresponding file on the right and it preserves records ordering
on the right side in the same order as on the left side, that allows the whole
input on the left and right to have the same number of files and the same
number of records in corresponding files and the same ordering of records in
each file pair. Then run ParquetJoiner in parallel for each file pair and
perform a join. [Code sample to be provided separately?]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]