Re: GSoC 2019

Stephen Ermshar Fri, 23 Aug 2019 17:08:40 -0700

This is the Final project report for adding a Merge Joiner to AsterixDB for 
Google Summer of Code 2019. There is also a copy in a GitHub Gist 
[here](https://gist.github.com/stephenermshar/bfa28dad52d9ba5075c06702aab3a8b8)

Thanks to GSoC, Apache, and my mentors for helping make this summer such a 
great and valuable experience for me; I feel like I’ve grown a lot through this 
project and plan to continue working on it.

---

# Adding Merge Join to AsterixDB (GSoC 2019)

## Project Description

This project adds the merge join algorithm to AsterixDB. It includes spilling to
disk when memory runs out, and a hint for SQL++ that is currently the only way
to make AsterixDB use the algorithm.

### Core Function

The main function `processJoin` in the `MergeJoiner` class works by reading from
the two inputs and incrementing the one with the smaller value until their keys
match. It saves all the tuples of the matching key from the right branch into a
buffer and then joins them with each tuple from the left branch until it reaches
one with a key that doesn't match. It starts over and repeats until there are no
more matching tuples.

If the right buffer fills, then all left tuples with the matching key are saved
to the disk. All the left tuples from the disk are joined with the right buffer.
The right buffer is repeatedly cleared, refilled, and joined with the left
tuples from the disk until there are no more matching tuples from the right
branch. When all the tuples with the matching key have been joined, it goes back
to the main loop and continues as normal.

### Producer Consumer Frame

The new `ProducerConsumerFrameState` class allows the merge joiner to request
the next frame from its input when it's ready to begin joining it. The
`nextFrame` function from the input frame writer passes its buffer to the
`ProducerConsumerFrameState` which then locks until the merge joiner requests
the next frame from the `ProducerConsumerFrameState`. The merge joiner also
locks if it requests the next frame before the input frame writer has provided
it. Locks are released when the merge joiner closes.

### Hint

The merge joiner is only used when its hint is used, and the hint is only
implemented for the SQL++ parser. The `JoinUtils` class checks for the hint and
sets the physical operator to the merge joiner if the hint is found.

### Testing

There are three tests for the new operator. There is an Optimizer test to make
sure the operator is called by the hint and the data is sorted on the same key
for both inputs. There are two Runtime tests, one runs without spilling, and the
other reduces join memory, includes a column of filler data, and adds extra rows
on the right to make the joiner fill the buffer and use the disk.

## Current State

There is an issue with the test that uses the disk. In the SQL++ Execution tests
there are a couple missing tuples in the results, though the same query produces
the correct result when run in the `AsterixHyracksIntegrationUtil`.

We're working on doing code reviews, and once the tests pass we should be almost
ready to merge the gerrit change into master.

## Future Work

The Merge Joiner doesn't have a failure case for when the disk is full, it only
handles the case where memory is full.

There may be ways to improve cache locality by handling frames differently.

Currently the joiner doesn't support conjunctive predicates or multi-joins.

## Project Challenges

A few of the challenges I had at the beginning of the summer had to do with
learning to use IntelliJ Idea and its debugger. Over time My mentors pointed out
various shortcuts and features that significantly improved my productivity. I
also got fairly comfortable using the debugger. Working with issues related to
different threads has still been a challenge towards the end of the summer, one
way I handled that was with conditional breakpoints that stopped the program
when a specific problematic tuple was found. Other challenges related to
IntelliJ had to do with building small changes faster, and figuring out how to
run the Integration Util from IntelliJ.

Understanding how to think about buffers and frames was difficult for me at
first, though once I had a handle on the interfaces used for frames I was able
to fix problems in my code much more quickly.

Communicating ideas about AsterixDB had a steep learning curve; while I had some
experience in AsterixDB from before the project I hadn't been working on it full
time, so I still had to get used to the terminology. I handled this at the
beginning of the summer by writing out parts of the algorithm in python so I
could more easily understand what was happening before implementing it in
AsterixDB.

## Related Links

- [My fork of AsterixDB](https://github.com/stephenermshar/asterixdb)
- [Gerrit change](https://asterix-gerrit.ics.uci.edu/#/c/3478/12/)
- [Copy of project 
proposal](https://gist.github.com/stephenermshar/deb2bb7031ca162ba604dac976923496)

# Adding Range Multi Partitioning Operator

In addition to the merge joiner we worked on a project that we'd started before
the summer. The Range Multi Partitioning Operator is for determining which
partitions to send ordered data to based on the range partitioning types
outlined in
[Processing Interval Joins On 
Map-Reduce](http://www.openproceedings.org/EDBT/2014/paper_176.pdf).

The main change we've been working on is
[[ASTERIXDB-2605][HYR][\*DB] add new range multi partitioning 
operator](https://asterix-gerrit.ics.uci.edu/#/c/3223/30/),
though recently we split off a smaller change
[[NO ISSUE] [HYR] Introduce connector for partial 
broadcasts](https://asterix-gerrit.ics.uci.edu/#/c/3524/3/).

The gerrit change commit messages contain more information about this project.

# Thanks

Thanks to Preston Carman and Ali Alsuliman who mentored me on this project, as
well as Dmitry Lychagin who joined in on our meetings and provided lots of
useful feedback and review.

Thanks to Preston Carman for also meeting with me in person during this project
and taking his time to explain concepts I tended struggled with.

Thanks to Apache and the AsterixDB community for this great opportunity to grow
and experience working on an exciting project.

---

- Stephen Ermshar
On Aug 8, 2019, 10:40 PM -0700, Stephen Ermshar <[email protected]>, 
wrote:
> Merge Join Status:
>
> • New changes are on GitHub 
> (https://github.com/stephenermshar/asterixdb/tree/stephenermshar/merge-join/master)
>  and gerrit (https://asterix-gerrit.ics.uci.edu/#/c/3478/).
>     • The main function of the merge-join has been reorganized to be cleaner 
> and easier to follow.
>     • Spilling to disk is implemented and works in a test query.
>     • There is a working hint for SQLPP queries.
> • The next step is to clean up and organize the code, there are a few places 
> in the main class of the joiner that could be much clearer.
> • We should be doing code reviews and wrapping the change up in the next 
> couple weeks.
>
>
> - Stephen Ermshar
> On Jul 23, 2019, 4:25 PM -0700, Stephen Ermshar <[email protected]>, 
> wrote:
> > GSoC 2019: Implementing Merge Join:
> >
> > • We have a working simple merge joiner implementation in a gerrit change 
> > here (https://asterix-gerrit.ics.uci.edu/#/c/3478/).
> > • This next week we’ll be working on handling limited memory by spilling 
> > with a run file.
> > • The next step will be to setup the hint and optimizer to use the merge 
> > joiner when appropriate; currently we have added a condition in JoinUtils 
> > that forces it to always use the merge joiner.
> >
> >
> > Multi Partitioning / Partial Broadcast Operator:
> >
> > • We’ve also been working on a separate lower priority change that is 
> > mainly focused on supporting interval join partitioning of Project, Split, 
> > and Replicate partitioning schemes.
> > • That change is also on gerrit here 
> > (https://asterix-gerrit.ics.uci.edu/#/c/3223/) with more details in the 
> > commit message.
> >
> >
> > - Stephen Ermshar
> > On Jul 2, 2019, 1:05 PM -0700, Stephen Ermshar <[email protected]>, 
> > wrote:
> > > Here the first update for GSoC 2019: Implementing Merge Join. I plan on 
> > > writing updates like this every two weeks. Let me know if there are any 
> > > thoughts on the project or these updates. Thanks!
> > >
> > > Our current status:
> > >
> > > • My fork of AsterixDB for GSoC is here 
> > > (https://github.com/stephenermshar/asterixdb).
> > > • When we started we worked on pulling in old merge-join code to start 
> > > off with, but then decided to put end to end tests for the query and the 
> > > optimizer in place first.
> > >     • The optimizer test is currently blank. The runtime test in 
> > > `merge-join/equi-join` is based on the `btree-index-nested-loop-join` 
> > > test and has a merge-join hint.
> > >     • Those tests are in my fork on the branch 
> > > [stephenermshar/merge-join/Import-original-hyracks-merge-join](https://github.com/stephenermshar/asterixdb/tree/stephenermshar/merge-join/import-original-hyracks-merge-join).
> > >
> > > The general plan at this point is to add a Merge Join Operator. It will 
> > > have an activity to take two data input streams, and a joiner activity 
> > > that can request data from the input activity when it’s ready. Our first 
> > > implementation won’t include spilling to disk.
> > >
> > > - Stephen Ermshar
> > > On May 28, 2019, 2:35 PM -0700, Mike Carey <[email protected]>, wrote:
> > > > Cool!  I would be happy to be looped in occasionally as well - so I will
> > > > watch this space for the public progress updates.  In addition to adding
> > > > the algorithm, we'll need to add an optimizer rule (actually update an
> > > > existing rule) so that it gets picked when appropriate - and also a hint
> > > > so that you can manually suggest that it be picked regardless.  DB2
> > > > chooses this join method when the incoming data arguments are already
> > > > sorted, so that would likely be a good heuristic for us too.  (It is
> > > > sure to be cheaper than anything else in that case.)  (So think about
> > > > this as adding a Merge Join, not a Sort Merge Join, actually, as is not
> > > > likely to be the case that sorting and then doing this will be a cost 
> > > > win.)
> > > >
> > > > On 5/26/19 9:22 PM, Stephen Ermshar (gmail) wrote:
> > > > > Hi Ali,
> > > > >
> > > > > Since the coding period starts tomorrow I wanted to get in touch with 
> > > > > you again.
> > > > >
> > > > > Preston and I were thinking it would be good to post weekly reports 
> > > > > here on the dev mailing list to keep a public record of our progress. 
> > > > > We’d also like to have weekly or twice-weekly meetings to track our 
> > > > > progress and direction. If you have any thoughts on how we can 
> > > > > organize our efforts more this summer let me know.
> > > > >
> > > > > This week Preston and I decided we’d focus on refreshing my knowledge 
> > > > > of the Sort Merge Join algorithm. I’d like to also get a fresh 
> > > > > development environment setup for this summer and if possible start 
> > > > > moving old code in from the outdated repository.
> > > > >
> > > > > I’m excited to start working on this!
> > > > >
> > > > > - Stephen Ermshar
> > > > >

Re: GSoC 2019

Reply via email to