MaxNevermind opened a new pull request, #1273: URL: https://github.com/apache/parquet-mr/pull/1273
**_This PR is Work In Progress. More code changes will be added and description will be extended._** ## Overview ParquetJoiner feature is similar to ParquetRewrite class. ParquetRewrite allows to stitch files with the same schema into a single file while ParquetJoiner should enable stitching files with different schemas into a single file. That is possible when: 1) the number of rows in the main and extra files is the same, 2) the ordering of rows in the main and extra files is the same. Main benefit of ParquetJoiner is performance, for the cases when you join/stitch Terabytes/Petabytes of data that seemingly simple low level API can be up to 10x more resource efficient. ## Implementation details ParquetJoiner allows to specify the main input parquet file and extra input parquet files. ParquetJoiner will copy the main input as binary data and write extra input files with row groups adjusted to the main input. If main input is much larger than extra inputs then a lot of resources will be saved by working with the main input as binary. ## Use-case examples A very large Parquet based dataset(dozens or hundreds of fields/Terabytes of data daily/Petabytes of historical partitions). The task is to modify a column or add a new column to it for all the historic data. It is trivial using Spark, but taking into consideration the share scale of a dataset it will take a lot of resources to do that. ### Side notes Note that this class of problems could be in theory solved by storing main input and extra inputs in HMS/Iceberg bucketed tables and use a view that joins those tables on the fly into the final version but in practice there is often a requirement to merge parquet files and have a single parquet sources in the file system. ## Use-case implementation details using Apache Spark You can use Apache Spark to perform the join with ParquetJoiner, read the large main input and prepare the right side of a join in a way that each file on the left have a corresponding file on the right and it preserves records ordering on the right side in the same order as on the left side, that allows the whole input on the left and right to have the same number of files and the same number of records in corresponding files and the same ordering of records in each file pair. Then run ParquetJoiner in parallel for each file pair and perform a join. [Code sample to be provided separately?] Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Style - [ ] My contribution adheres to the code style guidelines and Spotless passes. - To apply the necessary changes, run `mvn spotless:apply -Pvector-plugins` ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
