Re: Drill & Behavioral Data

Jacques Nadeau Sat, 16 Feb 2013 10:46:40 -0800

It could be.  Since you know more about your problem set than we do, here
are ways to think about Drill:


   - Drill is not a database but rather a query layer that works with a
   number of underlying storage technologies
   - Drill is designed primarily to do full table scans of all relevant
   data as opposed to maintaing things like indexes.  If you need single
   record or point lookups, you're heavily dependent on the capabilities of
   the underlying data storage engine
   - Drill is designed for massive scale.  If you're data can fit on one
   computer now and for the foreseeable future, other options will probably
   work better.
   - Drill is designed specifically to be extensible through a well-defined
   set of APIs.  If your problem set extends beyond traditional dataflow
   operators but can work in a dataflow model, Drill may be adapted to your
   problem easily.
   - One of these extensibility points is query language.  Because Drill
   has a concrete Logical Plan, you could develop an alternative DSL that
   better reflects your needs without having to rebuild the entire distributed
   processing system.

Does that help at all?

Thanks,
Jacques


On Fri, Feb 15, 2013 at 12:48 PM, Ben Johnson <[email protected]> wrote:

> I recently had lunch with Ted Dunning and we discussed machine learning
> and a database project I've been working on called Sky (http://skydb.io/).
> After talking with him for a while he suggested I look at Drill as being a
> possible platform to build my behavioral analysis tool on top of (instead
> of writing it myself) and post a message to the mailing list.
>
> My database essentially stores actions and state changes for individual
> things. For example, you can store a web page visit or an e-mail open as an
> action and you can store someone's name or salary as a state and track that
> over time. It's basically like Datomic but actions can be tracked and data
> related to actions is not persisted outside the context of the action (e.g.
> a purchase amount only exists for an individual checkout action).
>
> Another reason I wrote the database was that I needed my own way to
> process this data. SQL tends to work with relational data but doesn't let
> me relate a row in an "actions" table with another row that might be two
> days before or might be in the next web site session.
>
> Performance is also a big factor. I'm able to run analysis on my data at
> around 10MM events/core/second and I'm hoping to get up to around 50MM
> events/core/second once I do some profiling and optimization. Will I lose a
> lot of performance having to join across data sources using drill bits?
>
> Ted said that I can accomplish a lot of this with Nested SQL and UDFs. Do
> you think I will get the performance and flexibility I'm looking for? Could
> Drill be a good platform to build this type of system on top of?
>
> Also, here's a demo of interactive analysis using a small subset of the
> GitHub Archive. It shows actions like pushes and repository creates and
> then lets you follow what people do after those actions. It's currently
> action only but future versions will let you segment by point-of-time state.
>
> http://demo.skydb.io/
>
>
> Ben Johnson
>
>
>
>

Re: Drill & Behavioral Data

Reply via email to