Got it,

Let me start working on this. I will modify the PR, It will probably look
clunky at first iteration, but we can converge it, eliminate code
duplication and try to make it simple at the end.

Max.

On Mon, Feb 19, 2024 at 1:49 AM Gang Wu <ust...@gmail.com> wrote:

> I agree it is challenging to extend ParquetRewriter. However, it would also
> add maintenance overhead if two classes share a lot of similar features.
> For example, we have to implement a new feature twice to make sure both
> rewriter and joiner support it. The birth of class ParquetRewriter itself
> is the
> result of a refactoring work to merge different classes with
> similar features.
> IMO, it is OK not to support all existing rewrite features (e.g. nullify
> column)
> while joining columns from different files. WDYT?
>
> For the potential bugs, we can file JIRA issues to discuss them separately.
>
> Best,
> Gang
>
> On Mon, Feb 19, 2024 at 10:36 AM Xinli shang <sha...@uber.com.invalid>
> wrote:
>
> > HI Max,
> >
> > This is a very interesting feature with a great idea!
> >
> > Xinli
> >
> > On Sun, Feb 18, 2024 at 1:19 PM Max Konstantinov <
> > konstantinov.ma...@gmail.com> wrote:
> >
> > > Hi Gang,
> > >
> > >
> > > I actually started working on this feature by trying to extend
> > > ParquetRewriter with that new capability, but I quickly ran into
> issues:
> > > - Many state holder variables in ParquetRewriter will have to be
> > duplicated
> > > for the left / right side of the join
> > > - Some methods(ex: nullifyColumn) are very close but not the same for
> > > joiner and will require more branching in the existing codebase of
> > > ParquetRewriter
> > > - Tests for a Join part will have to pay a special attention to the
> right
> > > side of the join as it is a new thing in a joiner so it will blow out
> > test
> > > class quite a bit
> > > To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter
> > while
> > > possible will potentially make codebase too complex and hard to reason
> > > about.
> > >
> > > I thought about it, maybe we can utilize Factory method & Builder
> > patterns
> > > for this? For example we can:
> > > - Unify Options(RewriteOptions/JoinOptions) into a single class, if one
> > of
> > > the final implementation is not supporting a certain feature it should
> > > throw exception during construction
> > > - Use Factory pattern approach and pick the actual final implementation
> > of
> > > the class based on provided options
> > > - Both ParquetRewriter & ParquetJoiner will implement a new Interface
> > that
> > > has processBlocks() & close() public method
> > > - Use Builder pattern approach and make all methods including
> > constructors
> > > private besides those that need to be exposed to users
> > > By using this approach we can simplify internal implementation by
> > dividing
> > > it into separate dedicated smaller sub-modules while still providing a
> > > single feature rich external API. Let me know what you think.
> > >
> > > Also I might be wrong but I’ve noticed a few potential issues with
> > > ParquetRewriter:
> > > - not sure if that is a bug in but is not we supposed to consume() on
> the
> > > reader, here is the place in ParquetJoiner
> > > <
> > >
> >
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345
> > > >
> > > you can find the same place in ParquetRewriter
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771
> > > >
> > > ?
> > > - newSchema() & extractField() methods seem to have a small issue with
> > > complex nested schemas, here is the line that is different in
> > > ParquetRewriter
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815
> > > >
> > > and here is ParquetJoiner
> > > <
> > >
> >
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390
> > > >.
> > > ParquetRewriter’s version originally failed in my tests, it originally
> > had
> > > a different schema, I will try to reproduce it later, right now it
> > works, I
> > > need to go back and check the commit history.
> > >
> > > btw
> > > I wanted to start covering ParquetJoiner with extensive tests the next
> > > week.
> > > Also we built POC for one of our projects based on the current version
> of
> > > ParquetJoiner and it showed up to x10 improvement in performance in
> > > comparison with a default join implementation using Spark.
> > >
> > >
> > > Max.
> > >
> > > On Sat, Feb 17, 2024 at 10:37 PM Gang Wu <ust...@gmail.com> wrote:
> > >
> > > > Hi Max,
> > > >
> > > > Thanks for proposing the joiner! I simply took a glimpse of the PR
> and
> > > > it looks promising to me. My general question is on the possibility
> of
> > > > consolidating the work with ParquetRewriter, which shares a lot of
> > > > common rewriting logic.
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov <
> > > > konstantinov.ma...@gmail.com> wrote:
> > > >
> > > > > Hi Parquet dev team!
> > > > >
> > > > >
> > > > > I wanted to ask your opinion on the proposal I came up with.
> > > > > PR: https://github.com/apache/parquet-mr/pull/1273
> > > > > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
> > > > > PR's description and JIRA ticket contains all the details, please
> > check
> > > > it
> > > > > out. The feature is not yet ready to merge, it is just a proposal
> for
> > > > now.
> > > > > I wanted to ask a PARQUET community opinion if you see any
> obstacles
> > > for
> > > > > adding it? We find it very useful and plan to use it and if PARQUET
> > > > > community finds no issues with it I can add tests, javadocs and
> > polish
> > > it
> > > > > so we can add this new feature to PARQUET.
> > > > >
> > > > >
> > > > > Max.
> > > > >
> > > >
> > >
> >
> >
> > --
> > Xinli Shang
> >
>

Reply via email to