Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

Max Konstantinov Sun, 18 Feb 2024 13:19:06 -0800

Hi Gang,


I actually started working on this feature by trying to extend
ParquetRewriter with that new capability, but I quickly ran into issues:
- Many state holder variables in ParquetRewriter will have to be duplicated
for the left / right side of the join
- Some methods(ex: nullifyColumn) are very close but not the same for
joiner and will require more branching in the existing codebase of
ParquetRewriter
- Tests for a Join part will have to pay a special attention to the right
side of the join as it is a new thing in a joiner so it will blow out test
class quite a bit
To summarize, in my opinion: adding ParquetJoiner to ParquetRewriter while
possible will potentially make codebase too complex and hard to reason
about.

I thought about it, maybe we can utilize Factory method & Builder patterns
for this? For example we can:
- Unify Options(RewriteOptions/JoinOptions) into a single class, if one of
the final implementation is not supporting a certain feature it should
throw exception during construction
- Use Factory pattern approach and pick the actual final implementation of
the class based on provided options
- Both ParquetRewriter & ParquetJoiner will implement a new Interface that
has processBlocks() & close() public method
- Use Builder pattern approach and make all methods including constructors
private besides those that need to be exposed to users
By using this approach we can simplify internal implementation by dividing
it into separate dedicated smaller sub-modules while still providing a
single feature rich external API. Let me know what you think.

Also I might be wrong but I’ve noticed a few potential issues with
ParquetRewriter:
- not sure if that is a bug in but is not we supposed to consume() on the
reader, here is the place in ParquetJoiner
<https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L345>
you can find the same place in ParquetRewriter
<https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L771>
?
- newSchema() & extractField() methods seem to have a small issue with
complex nested schemas, here is the line that is different in
ParquetRewriter
<https://github.com/apache/parquet-mr/blob/b2080aa5735a97e5896260e82bde9b2b8455432a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L815>
and here is ParquetJoiner
<https://github.com/MaxNevermind/parquet-mr/blob/7ae35059ae9801e0ffb7f9a0dc825621dbc37ecc/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/join/ParquetJoiner.java#L390>.
ParquetRewriter’s version originally failed in my tests, it originally had
a different schema, I will try to reproduce it later, right now it works, I
need to go back and check the commit history.

btw
I wanted to start covering ParquetJoiner with extensive tests the next
week.
Also we built POC for one of our projects based on the current version of
ParquetJoiner and it showed up to x10 improvement in performance in
comparison with a default join implementation using Spark.


Max.

On Sat, Feb 17, 2024 at 10:37 PM Gang Wu <[email protected]> wrote:

> Hi Max,
>
> Thanks for proposing the joiner! I simply took a glimpse of the PR and
> it looks promising to me. My general question is on the possibility of
> consolidating the work with ParquetRewriter, which shares a lot of
> common rewriting logic.
>
> Best,
> Gang
>
> On Tue, Feb 13, 2024 at 9:27 AM Max Konstantinov <
> [email protected]> wrote:
>
> > Hi Parquet dev team!
> >
> >
> > I wanted to ask your opinion on the proposal I came up with.
> > PR: https://github.com/apache/parquet-mr/pull/1273
> > JIRA: https://issues.apache.org/jira/browse/PARQUET-2430
> > PR's description and JIRA ticket contains all the details, please check
> it
> > out. The feature is not yet ready to merge, it is just a proposal for
> now.
> > I wanted to ask a PARQUET community opinion if you see any obstacles for
> > adding it? We find it very useful and plan to use it and if PARQUET
> > community finds no issues with it I can add tests, javadocs and polish it
> > so we can add this new feature to PARQUET.
> >
> >
> > Max.
> >
>

Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

Reply via email to