Thanks for that summary, Dylan! Very helpful.
On Mon, Jun 13, 2016, 01:36 Dylan Hutchison <[email protected]>
wrote:
> Thanks for sharing Sean. Here are some notes I wrote after reading the
> article on Presto-Accumulo design. I have a research interest in the
> relationship between relational (SQL) and non-relational (Accumulo)
> systems, so I couldn't resist reading the post in detail.
>
> - Places the primary key in the Accumulo row.
> - Performs row-at-a-time processing (each tuple is one row in
> Accumulo) using WholeRowIterator behavior.
> - Relational table metadata is stored in the Presto infrastructure (as
> opposed to an Accumulo table).
> - Supports the creation of index tables for any attributes. These
> index tables speed up queries that filter on indexed attributes. It is
> standard secondary indexing, which provides speedups when the selectivity
> of the query is roughly <10% of the original table.
> - Only database->client querying is supported. You cannot run "select
> ... into result_table".
> - As far as I can see, Presto only has one join strategy: *broadcast
> join*. The right table of every join is scanned into one of the
> Presto worker's memory. Subsequently the size of the right table is
> limited by worker memory.
> - There is one Presto worker for each Accumulo tablet, which enables
> good scaling.
> - The Presto bridge classes track internal Accumulo information such
> as the assignment of tablets to tablet servers by reading Accumulo's
> Metadata table. Presto uses tablet locations to provide better locality.
> - The Presto bridge comes with several Accumulo server-side iterators
> for filtering and aggregating.
> - The code is quite nice and clean.
>
> This image below gives Presto's architecture. Accumulo takes the role of
> the DB icon in the bottom-right corner.
>
> [image: Inline image 2]
>
> Bloomberg ran 13 out of the 22 TPC-H queries. There is no fundamental
> reason why they cannot run all the queries; they just have not implemented
> everything required ('exists' clauses, non-equi join, etc.).
>
> The interface looks like this, though they use a compiled java jar to
> insert entries from a csv file (it wraps around a BatchWriter).
>
> [image: Inline image 3]
>
> Here are performance results. They don't say what hardware or data sizes
> they use. Whatever it is, they must have the ability to fit the smaller
> table of any join into memory as a result of Presto's broadcast join
> strategy. The strong scaling looks very nice.
>
> [image: Inline image 4]
>
> They have one other plot that shows how secondary indexing speeds up some
> queries with low selectivity.
>
> Cheers, Dylan
>
>
>
> On Sun, Jun 12, 2016 at 7:06 PM, Sean Busbey <[email protected]> wrote:
>
>> Bloomberg have a post about a connector they made to query Accumulo from
>> Presto:
>>
>>
>> http://www.bloomberg.com/company/announcements/open-source-at-bloomberg-reducing-application-development-time-via-presto-accumulo/
>>
>> --
>> Sean Busbey
>>
>
>