Hi Gang,
I actually started working on this feature by trying to extend ParquetRewriter with that new capability, but I quickly ran into issues: - Many state holder variables in ParquetRewriter will have to be duplicated for the left / right side of the join - Some methods(ex: nullifyColumn) are very close but not the same for joiner and will require more branching in the existing codebase of ParquetRewriter - Tests for a Join part will have to pay a special attention to the right side of the join as it is a new thing in a joiner so it will blow out test class quite a bit To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter while possible will potentially make codebase too complex and hard to reason about. I thought about it, maybe we can utilize Factory method & Builder patterns for this? For example we can: - Unify Options(RewriteOptions/JoinOptions) into a single class, if one of the final implementation is not supporting a certain feature it should throw exception during construction - Use Factory pattern approach and pick the actual final implementation of the class based on provided options - Both ParquetRewriter & ParquetJoiner will implement a new Interface that has processBlocks() & close() public method - Use Builder pattern approach and make all methods including constructors private besides those that need to be exposed to users By using this approach we can simplify internal implementation by dividing it into separate dedicated smaller sub-modules while still providing a single feature rich external API. Let me know what you think. Also I might be wrong but I’ve noticed a few potential issues with ParquetRewriter: - not sure if that is a bug in but is not we supposed to consume() on the reader, here is the place in ParquetJoiner <https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345> you can find the same place in ParquetRewriter <https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771> ? - newSchema() & extractField() methods seem to have a small issue with complex nested schemas, here is the line that is different in ParquetRewriter <https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815> and here is ParquetJoiner <https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390>. ParquetRewriter’s version originally failed in my tests, it originally had a different schema, I will try to reproduce it later, right now it works, I need to go back and check the commit history. btw I wanted to start covering ParquetJoiner with extensive tests the next week. Also we built POC for one of our projects based on the current version of ParquetJoiner and it showed up to x10 improvement in performance in comparison with a default join implementation using Spark. Max. On Sat, Feb 17, 2024 at 10:37 PM Gang Wu <ust...@gmail.com> wrote: > Hi Max, > > Thanks for proposing the joiner! I simply took a glimpse of the PR and > it looks promising to me. My general question is on the possibility of > consolidating the work with ParquetRewriter, which shares a lot of > common rewriting logic. > > Best, > Gang > > On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov < > konstantinov.ma...@gmail.com> wrote: > > > Hi Parquet dev team! > > > > > > I wanted to ask your opinion on the proposal I came up with. > > PR: https://github.com/apache/parquet-mr/pull/1273 > > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430 > > PR's description and JIRA ticket contains all the details, please check > it > > out. The feature is not yet ready to merge, it is just a proposal for > now. > > I wanted to ask a PARQUET community opinion if you see any obstacles for > > adding it? We find it very useful and plan to use it and if PARQUET > > community finds no issues with it I can add tests, javadocs and polish it > > so we can add this new feature to PARQUET. > > > > > > Max. > > >