HI Max,

This is a very interesting feature with a great idea!

Xinli

On Sun, Feb 18, 2024 at 1:19 PM Max Konstantinov <
konstantinov.ma...@gmail.com> wrote:

> Hi Gang,
>
>
> I actually started working on this feature by trying to extend
> ParquetRewriter with that new capability, but I quickly ran into issues:
> - Many state holder variables in ParquetRewriter will have to be duplicated
> for the left / right side of the join
> - Some methods(ex: nullifyColumn) are very close but not the same for
> joiner and will require more branching in the existing codebase of
> ParquetRewriter
> - Tests for a Join part will have to pay a special attention to the right
> side of the join as it is a new thing in a joiner so it will blow out test
> class quite a bit
> To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter while
> possible will potentially make codebase too complex and hard to reason
> about.
>
> I thought about it, maybe we can utilize Factory method & Builder patterns
> for this? For example we can:
> - Unify Options(RewriteOptions/JoinOptions) into a single class, if one of
> the final implementation is not supporting a certain feature it should
> throw exception during construction
> - Use Factory pattern approach and pick the actual final implementation of
> the class based on provided options
> - Both ParquetRewriter & ParquetJoiner will implement a new Interface that
> has processBlocks() & close() public method
> - Use Builder pattern approach and make all methods including constructors
> private besides those that need to be exposed to users
> By using this approach we can simplify internal implementation by dividing
> it into separate dedicated smaller sub-modules while still providing a
> single feature rich external API. Let me know what you think.
>
> Also I might be wrong but I’ve noticed a few potential issues with
> ParquetRewriter:
> - not sure if that is a bug in but is not we supposed to consume() on the
> reader, here is the place in ParquetJoiner
> <
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345
> >
> you can find the same place in ParquetRewriter
> <
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771
> >
> ?
> - newSchema() & extractField() methods seem to have a small issue with
> complex nested schemas, here is the line that is different in
> ParquetRewriter
> <
> https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815
> >
> and here is ParquetJoiner
> <
> https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390
> >.
> ParquetRewriter’s version originally failed in my tests, it originally had
> a different schema, I will try to reproduce it later, right now it works, I
> need to go back and check the commit history.
>
> btw
> I wanted to start covering ParquetJoiner with extensive tests the next
> week.
> Also we built POC for one of our projects based on the current version of
> ParquetJoiner and it showed up to x10 improvement in performance in
> comparison with a default join implementation using Spark.
>
>
> Max.
>
> On Sat, Feb 17, 2024 at 10:37 PM Gang Wu <ust...@gmail.com> wrote:
>
> > Hi Max,
> >
> > Thanks for proposing the joiner! I simply took a glimpse of the PR and
> > it looks promising to me. My general question is on the possibility of
> > consolidating the work with ParquetRewriter, which shares a lot of
> > common rewriting logic.
> >
> > Best,
> > Gang
> >
> > On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov <
> > konstantinov.ma...@gmail.com> wrote:
> >
> > > Hi Parquet dev team!
> > >
> > >
> > > I wanted to ask your opinion on the proposal I came up with.
> > > PR: https://github.com/apache/parquet-mr/pull/1273
> > > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
> > > PR's description and JIRA ticket contains all the details, please check
> > it
> > > out. The feature is not yet ready to merge, it is just a proposal for
> > now.
> > > I wanted to ask a PARQUET community opinion if you see any obstacles
> for
> > > adding it? We find it very useful and plan to use it and if PARQUET
> > > community finds no issues with it I can add tests, javadocs and polish
> it
> > > so we can add this new feature to PARQUET.
> > >
> > >
> > > Max.
> > >
> >
>


-- 
Xinli Shang

Reply via email to