Re: Apache Accumulo integrated with Presto

Adam J. Shook Mon, 13 Jun 2016 08:25:31 -0700

A few clarifications:

- Presto supports hash-based distributed joins as well as broadcast joins


- Presto metadata is stored in ZooKeeper, but metadata storage is pluggable
and could be stored in Accumulo instead

- The connector does use tablet locality when scanning Accumulo, but our
testing has shown you get better performance by giving Accumulo and Presto
their own dedicated machines, making locality a moot point.  This will
certainly change based on types of queries, data sizes, network quality,
etc.

- You can insert the results of a query into a Presto table using INSERT
INTO foo SELECT ..., as well as create a table from the results of a query
(CTAS).  Though, for large inserts, it is typically best to bypass the
Presto layer and insert directly into the Accumulo tables using the
PrestoBatchWriter API

Cheers,
--Adam

On Mon, Jun 13, 2016 at 7:20 AM, Christopher <[email protected]> wrote:

> Thanks for that summary, Dylan! Very helpful.
>
> On Mon, Jun 13, 2016, 01:36 Dylan Hutchison <[email protected]>
> wrote:
>
> > Thanks for sharing Sean.  Here are some notes I wrote after reading the
> > article on Presto-Accumulo design.  I have a research interest in the
> > relationship between relational (SQL) and non-relational (Accumulo)
> > systems, so I couldn't resist reading the post in detail.
> >
> >    - Places the primary key in the Accumulo row.
> >    - Performs row-at-a-time processing (each tuple is one row in
> >    Accumulo) using WholeRowIterator behavior.
> >    - Relational table metadata is stored in the Presto infrastructure (as
> >    opposed to an Accumulo table).
> >    - Supports the creation of index tables for any attributes. These
> >    index tables speed up queries that filter on indexed attributes.  It
> is
> >    standard secondary indexing, which provides speedups when the
> selectivity
> >    of the query is roughly <10% of the original table.
> >    - Only database->client querying is supported.  You cannot run "select
> >    ... into result_table".
> >    - As far as I can see, Presto only has one join strategy: *broadcast
> >    join*.  The right table of every join is scanned into one of the
> >    Presto worker's memory.  Subsequently the size of the right table is
> >    limited by worker memory.
> >    - There is one Presto worker for each Accumulo tablet, which enables
> >    good scaling.
> >    - The Presto bridge classes track internal Accumulo information such
> >    as the assignment of tablets to tablet servers by reading Accumulo's
> >    Metadata table. Presto uses tablet locations to provide better
> locality.
> >    - The Presto bridge comes with several Accumulo server-side iterators
> >    for filtering and aggregating.
> >    - The code is quite nice and clean.
> >
> > This image below gives Presto's architecture.  Accumulo takes the role of
> > the DB icon in the bottom-right corner.
> >
> > [image: Inline image 2]
> >
> > Bloomberg ran 13 out of the 22 TPC-H queries.  There is no fundamental
> > reason why they cannot run all the queries; they just have not
> implemented
> > everything required ('exists' clauses, non-equi join, etc.).
> >
> > The interface looks like this, though they use a compiled java jar to
> > insert entries from a csv file (it wraps around a BatchWriter).
> >
> > [image: Inline image 3]
> >
> > Here are performance results.  They don't say what hardware or data sizes
> > they use.  Whatever it is, they must have the ability to fit the smaller
> > table of any join into memory as a result of Presto's broadcast join
> > strategy.  The strong scaling looks very nice.
> >
> > [image: Inline image 4]
> >
> > They have one other plot that shows how secondary indexing speeds up some
> > queries with low selectivity.
> >
> > Cheers, Dylan
> >
> >
> >
> > On Sun, Jun 12, 2016 at 7:06 PM, Sean Busbey <[email protected]>
> wrote:
> >
> >> Bloomberg have a post about a connector they made to query Accumulo from
> >> Presto:
> >>
> >>
> >>
> http://www.bloomberg.com/company/announcements/open-source-at-bloomberg-reducing-application-development-time-via-presto-accumulo/
> >>
> >> --
> >> Sean Busbey
> >>
> >
> >
>

Re: Apache Accumulo integrated with Presto

Reply via email to