[PR] [WIP][Proposal] PARQUET-2430: Add parquet joiner [parquet-mr]

via GitHub Mon, 12 Feb 2024 17:18:22 -0800


MaxNevermind opened a new pull request, #1273:
URL: https://github.com/apache/parquet-mr/pull/1273


   **_This PR is Work In Progress. More code changes will be added and 
description will be extended._**
   
   
   ## Overview
   ParquetJoiner feature is similar to ParquetRewrite class. ParquetRewrite 
allows to stitch files with the same schema into a single file while 
ParquetJoiner should enable stitching files with different schemas into a 
single file. That is possible when: 1) the number of rows in the main and extra 
files is the same, 2) the ordering of rows in the main and extra files is the 
same. Main benefit of ParquetJoiner is performance, for the cases when you 
join/stitch Terabytes/Petabytes of data that seemingly simple low level API can 
be up to 10x more resource efficient. 
   
   ## Implementation details
   ParquetJoiner allows to specify the main input parquet file and extra input 
parquet files. ParquetJoiner will copy the main input as binary data and write 
extra input files with row groups adjusted to the main input. If main input is 
much larger than extra inputs then a lot of resources will be saved by working 
with the main input as binary.
   
   ## Use-case examples
   A very large Parquet based dataset(dozens or hundreds of fields/Terabytes of 
data daily/Petabytes of historical partitions). The task is to modify a column 
or add a new column to it for all the historic data. It is trivial using Spark, 
but taking into consideration the share scale of a dataset it will take a lot 
of resources to do that. 
   
   ### Side notes
   Note that this class of problems could be in theory solved by storing main 
input and extra inputs in HMS/Iceberg bucketed tables and use a view that joins 
those tables on the fly into the final version but in practice there is often a 
requirement to merge parquet files and have a single parquet sources in the 
file system.
   
   ## Use-case implementation details using Apache Spark
   You can use Apache Spark to perform the join with ParquetJoiner, read the 
large main input and prepare the right side of a join in a way that each file 
on the left have a corresponding file on the right and it preserves records 
ordering on the right side in the same order as on the left side, that allows 
the whole input on the left and right to have the same number of files and the 
same number of records in corresponding files and the same ordering of records 
in each file pair. Then run ParquetJoiner in parallel for each file pair and 
perform a join. [Code sample to be provided separately?]
   
    
   
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references
     them in the PR title. For example, "PARQUET-1234: My Parquet PR"
       - https://issues.apache.org/jira/browse/PARQUET-XXX
       - In case you are adding a dependency, check if the license complies with
         the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines
     from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       1. Subject is limited to 50 characters (not including Jira issue 
reference)
       1. Subject does not end with a period
       1. Subject uses the imperative mood ("add", not "adding")
       1. Body wraps at 72 characters
       1. Body explains "what" and "why", not "how"
   
   ### Style
   - [ ] My contribution adheres to the code style guidelines and Spotless 
passes.
       - To apply the necessary changes, run `mvn spotless:apply 
-Pvector-plugins`
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
       - All the public functions and the classes in the PR contain Javadoc 
that explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [WIP][Proposal] PARQUET-2430: Add parquet joiner [parquet-mr]

Reply via email to