HI Max, This is a very interesting feature with a great idea!
Xinli On Sun, Feb 18, 2024 at 1:19 PM Max Konstantinov < konstantinov.ma...@gmail.com> wrote: > Hi Gang, > > > I actually started working on this feature by trying to extend > ParquetRewriter with that new capability, but I quickly ran into issues: > - Many state holder variables in ParquetRewriter will have to be duplicated > for the left / right side of the join > - Some methods(ex: nullifyColumn) are very close but not the same for > joiner and will require more branching in the existing codebase of > ParquetRewriter > - Tests for a Join part will have to pay a special attention to the right > side of the join as it is a new thing in a joiner so it will blow out test > class quite a bit > To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter while > possible will potentially make codebase too complex and hard to reason > about. > > I thought about it, maybe we can utilize Factory method & Builder patterns > for this? For example we can: > - Unify Options(RewriteOptions/JoinOptions) into a single class, if one of > the final implementation is not supporting a certain feature it should > throw exception during construction > - Use Factory pattern approach and pick the actual final implementation of > the class based on provided options > - Both ParquetRewriter & ParquetJoiner will implement a new Interface that > has processBlocks() & close() public method > - Use Builder pattern approach and make all methods including constructors > private besides those that need to be exposed to users > By using this approach we can simplify internal implementation by dividing > it into separate dedicated smaller sub-modules while still providing a > single feature rich external API. Let me know what you think. > > Also I might be wrong but I’ve noticed a few potential issues with > ParquetRewriter: > - not sure if that is a bug in but is not we supposed to consume() on the > reader, here is the place in ParquetJoiner > < > https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345 > > > you can find the same place in ParquetRewriter > < > https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771 > > > ? > - newSchema() & extractField() methods seem to have a small issue with > complex nested schemas, here is the line that is different in > ParquetRewriter > < > https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815 > > > and here is ParquetJoiner > < > https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390 > >. > ParquetRewriter’s version originally failed in my tests, it originally had > a different schema, I will try to reproduce it later, right now it works, I > need to go back and check the commit history. > > btw > I wanted to start covering ParquetJoiner with extensive tests the next > week. > Also we built POC for one of our projects based on the current version of > ParquetJoiner and it showed up to x10 improvement in performance in > comparison with a default join implementation using Spark. > > > Max. > > On Sat, Feb 17, 2024 at 10:37 PM Gang Wu <ust...@gmail.com> wrote: > > > Hi Max, > > > > Thanks for proposing the joiner! I simply took a glimpse of the PR and > > it looks promising to me. My general question is on the possibility of > > consolidating the work with ParquetRewriter, which shares a lot of > > common rewriting logic. > > > > Best, > > Gang > > > > On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov < > > konstantinov.ma...@gmail.com> wrote: > > > > > Hi Parquet dev team! > > > > > > > > > I wanted to ask your opinion on the proposal I came up with. > > > PR: https://github.com/apache/parquet-mr/pull/1273 > > > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430 > > > PR's description and JIRA ticket contains all the details, please check > > it > > > out. The feature is not yet ready to merge, it is just a proposal for > > now. > > > I wanted to ask a PARQUET community opinion if you see any obstacles > for > > > adding it? We find it very useful and plan to use it and if PARQUET > > > community finds no issues with it I can add tests, javadocs and polish > it > > > so we can add this new feature to PARQUET. > > > > > > > > > Max. > > > > > > -- Xinli Shang