Re: (Re)Introducing Culvert - A secondary indexing framework for BigTable like systems

Jesse Yates Fri, 23 Dec 2011 22:03:08 -0800

On Fri, Dec 23, 2011 at 9:28 AM, Mohit Anchlia <mohitanch...@gmail.com>wrote:


> I briefly looked at the presentation. May I ask how is it much
> different than using elasticsearch or solr? As I understand terms are
> being indexed which is also done by search engines. Just trying to
> understand the main benefit. We currently use Cassandra.
>
> Thanks
>

Culvert is designed not just to do search over documents, but to also do
general indexing over all your keyvalues. Chances are the things you are
storing are more than just unstructured text with some special key. If
thats the case, then some general, text based indexing is really all you
need. Right now, Culvert only supports a a built-in text-based index, but
is pretty easy to write new ones. The power in culvert comes from the fact
that it can integrate really easily with existing indexes (legacy systems)
and do indexing with some of its built-in indexes. If you want to look up
by something that is not the row key (primary key), then you will need to
have an index on that value - this is usually taken care of for you in
'traditional' SQL systems.

On top of just doing the indexing for you, Culvert does a lot of complex
query execution with a subset of SQL combined with a decorator design
pattern to make it really natural to build up queries. Because this
execution is built into the core of Culvert, it leverages the all the
information you have indexed - this means potentially orders of magnitude
faster queries. There is also a lot of potential work here, under the hood,
doing query optimization (culvert is pretty young).

We also can potentially do server-side joins. I don't know what Cassandra
supports in this field, but it would need to be something equivalent to
coprocessors in hbase (or a modified iterator for accumulo). Even not
having the server-side joins, we can still leverage the indexes in doing
the joins, making for much more efficient joins.

The Hive adapter is about 90% of the way there as well, which would give
you full index support on top of the ease that hive lets you write HQL for
your tables.

Finally, culvert allows you to be entirely cross-platform with other
BigTable style databases. All the queries and indexes are developed
entirely agnostically to the underlying datastore. So, if you wanted to
switch to HBase tomorrow, all you would need to do is  copy your data over
to the database (through the culvert client, though we've discussed adding
batch indexing) and then point culvert at the new install. All your queries
stay the same, leveraging the same indexes. The only work you need to
reproduce are any of the indexes you wrote by hand.

The adapter for Cassandra really wouldn't be that hard to write - there are
pretty good examples for how it works with hbase and accumulo, so I don't
expect the cassandra part to be that much different.

-Jesse



>
> On Fri, Dec 23, 2011 at 6:23 AM, John W Vines <john.w.vi...@ugov.gov>
> wrote:
> > We have yet to release accumulo-1.4, so that was all you working out of
> your local repo.
> >
> > As for Accumulo-1.3.5, we are currently working on making the
> appropriate changes to get make it kosher for a maven release, but we're
> not there yet.
> >
> > John
> >
> > ----- Original Message -----
> > | From: "Jesse Yates" <jesse.k.ya...@gmail.com>
> > | To: u...@hbase.apache.org
> > | Cc: d...@hbase.apache.org, accumulo-dev@incubator.apache.org,
> accumulo-u...@incubator.apache.org
> > | Sent: Thursday, December 22, 2011 5:22:46 PM
> > | Subject: Re: (Re)Introducing Culvert - A secondary indexing framework
> for BigTable like systems
> > | Wow, that's embarrassing - project not building...
> > |
> > | It's because accumulo's release is no longer deployed into the
> > | standard apache maven repository. Maybe one of the accumulo committers
> > | can shed some light on where to find it?
> > |
> > | I'll make some changes and have it at least compiling from the raw
> > | tonight :)
> > |
> > | The alternative is to download accumulo source (
> > | https://github.com/apache/accumulo ) and "mvn clean install" to get it
> > | working on your local machine.
> > |
> > | Thanks Ted!
> > |
> > | -Jesse
> > |
> > |
> > | On Thu, Dec 22, 2011 at 1:54 PM, Ted Yu < yuzhih...@gmail.com > wrote:
> > |
> > |
> > | Thanks for the update, Jesse.
> > | Let us know of any feature Culvert needs from HBase.
> > |
> > | After cloning Culvert, I got:
> > |
> > | [INFO] Culvert - Accumulo Integration .................... FAILURE
> > | [0.431s]
> > | [INFO]
> > |
> ------------------------------------------------------------------------
> > | [INFO] BUILD FAILURE
> > | [INFO]
> > |
> ------------------------------------------------------------------------
> > | [INFO] Total time: 1:06.638s
> > | [INFO] Finished at: Thu Dec 22 13:51:34 PST 2011
> > | [INFO] Final Memory: 20M/81M
> > | [INFO]
> > |
> ------------------------------------------------------------------------
> > | [ERROR] Failed to execute goal on project culvert-accumulo: Could not
> > | resolve dependencies for project
> > | com.bah.culvert:culvert-accumulo:jar:0.4.0-SNAPSHOT: Could not find
> > | artifact
> > | org.apache.accumulo:accumulo-core:jar:1.4.0-incubating-SNAPSHOT in
> > | apache-snapshots ( http://repository.apache.org/snapshots/ ) -> [Help
> > | 1]
> > |
> > | Can someone provide hint ?
> > |
> > | On Thu, Dec 22, 2011 at 11:44 AM, Jesse Yates <
> > | jesse.k.ya...@gmail.com >wrote:
> > |
> > |
> > | > Culvert was originally introduced at Hadoop Summit 2011, but recent
> > | > updates
> > | > have made it very applicable to current systems. Recently, we added
> > | > support
> > | > for Accumulo as well as upgraded HBase support to 0.92. Since Hadoop
> > | > Summit, there have also been significant code cleanup and added some
> > | > small
> > | > features. However, we found that most people hadn't heard of
> > | > Culvert, so we
> > | > wanted to re-release the framework.
> > | >
> > | > For an introduction to using Culvert, check out the blog post here:
> > | > http://jyates.github.com/2011/11/17/intro-to-culvert.html
> > | >
> > | > Also, the original presentation (where we discuss the internals) is
> > | > available on slideshare<
> > | >
> http://www.slideshare.net/jesse_yates/culvert-a-robust-framework-for-secondary-indexing-of-structured-and-unstructured-data
> > |
> > | > >
> > | > .
> > | >
> > | > There is a Culvert hackathon in the middle of January:
> > | > http://culverthackathon2012.eventbrite.com/
> > | >
> > | > Oh, and you can find the code on
> > | > github< https://github.com/booz-allen-hamilton/culvert >
> > |
> > |
> > | > .
> > | >
> > | > Below is an overview of why we wrote Culvert and what it does.
> > | >
> > | > Secondary indexing is a common design pattern in BigTable-like
> > | > databases
> > | > that allows users to index one or more columns in a table. This
> > | > technique
> > | > enables fast search of records in a database based on a particular
> > | > column
> > | > instead of the row id, thus enabling relational-style semantics in a
> > | > NoSQL
> > | > environment. Frequently, the index is stored either in a reserved
> > | > namespace
> > | > in the table or another index table.
> > | >
> > | > Despite the fact that this is a common design pattern in
> > | > BigTable-based
> > | > applications, most implementations of this practice to date have
> > | > been
> > | > tightly coupled with a particular application. As a result, few
> > | > general-purpose frameworks for secondary indexing on BigTable-like
> > | > databases exist, and those that do are tied to a particular
> > | > implementation
> > | > of the BigTable model.
> > | >
> > | > There are several existing tools (Solr, Lily), but these are focused
> > | > on
> > | > doing text based search and are highly restrictive to indexes
> > | > created
> > | > through their framework. What if you want to use your existing
> > | > indexes? Or
> > | > leverage the indexes to do complex queries?
> > | >
> > | > We developed a solution to this problem called Culvert that supports
> > | > online
> > | > index updates as well as a variation of the HIVE query language. In
> > | > designing Culvert, we sought to make the solution pluggable so that
> > | > it can
> > | > be used on any of the many BigTable-like databases (HBase,
> > | > Cassandra,
> > | > etc.). Furthermore, it is also easily extensible to existing, hand
> > | > rolled
> > | > indexes.
> > | >
> > | > As well as being a secondary indexing framework, it is also a query
> > | > execution mechanism - think pig/hive minus the fancy command line.
> > | > We
> > | > support a subset of SQL, but are able to take full advantage of
> > | > home-rolled
> > | > and built-in indexes, leading to query execution times potentially
> > | > orders
> > | > of magnitude smaller than existing approaches and certainly orders
> > | > of
> > | > magnitude more easily.
> > | >
> > | > -- Jesse
> > | > -------------------
> > | > Jesse Yates
> > | > 240-888-2200
> > | > @jesse_yates
> > | >
> > |
> > |
> > |
> > | --
> > | -------------------
> > | Jesse Yates
> > | 240-888-2200
> > | @jesse_yates
>



-- 
-------------------
Jesse Yates
240-888-2200
@jesse_yates

Re: (Re)Introducing Culvert - A secondary indexing framework for BigTable like systems

Reply via email to