Jaques-

Thank you very much for the explanation. That helps to answer a lot of my 
questions. I'm also aiming at doing full table scans (although I need to 
support sampling and per-record lookups sometimes) and my data is large enough 
to be spread across multiple machines. I'll have to do some more research 
around using a data flow model though.

Thanks again for taking the time to respond!


Ben Johnson
[email protected]



On Feb 16, 2013, at 11:46 AM, Jacques Nadeau wrote:

> It could be.  Since you know more about your problem set than we do, here
> are ways to think about Drill:
> 
>   - Drill is not a database but rather a query layer that works with a
>   number of underlying storage technologies
>   - Drill is designed primarily to do full table scans of all relevant
>   data as opposed to maintaing things like indexes.  If you need single
>   record or point lookups, you're heavily dependent on the capabilities of
>   the underlying data storage engine
>   - Drill is designed for massive scale.  If you're data can fit on one
>   computer now and for the foreseeable future, other options will probably
>   work better.
>   - Drill is designed specifically to be extensible through a well-defined
>   set of APIs.  If your problem set extends beyond traditional dataflow
>   operators but can work in a dataflow model, Drill may be adapted to your
>   problem easily.
>   - One of these extensibility points is query language.  Because Drill
>   has a concrete Logical Plan, you could develop an alternative DSL that
>   better reflects your needs without having to rebuild the entire distributed
>   processing system.
> 
> Does that help at all?
> 
> Thanks,
> Jacques
> 
> 
> On Fri, Feb 15, 2013 at 12:48 PM, Ben Johnson <[email protected]> wrote:
> 
>> I recently had lunch with Ted Dunning and we discussed machine learning
>> and a database project I've been working on called Sky (http://skydb.io/).
>> After talking with him for a while he suggested I look at Drill as being a
>> possible platform to build my behavioral analysis tool on top of (instead
>> of writing it myself) and post a message to the mailing list.
>> 
>> My database essentially stores actions and state changes for individual
>> things. For example, you can store a web page visit or an e-mail open as an
>> action and you can store someone's name or salary as a state and track that
>> over time. It's basically like Datomic but actions can be tracked and data
>> related to actions is not persisted outside the context of the action (e.g.
>> a purchase amount only exists for an individual checkout action).
>> 
>> Another reason I wrote the database was that I needed my own way to
>> process this data. SQL tends to work with relational data but doesn't let
>> me relate a row in an "actions" table with another row that might be two
>> days before or might be in the next web site session.
>> 
>> Performance is also a big factor. I'm able to run analysis on my data at
>> around 10MM events/core/second and I'm hoping to get up to around 50MM
>> events/core/second once I do some profiling and optimization. Will I lose a
>> lot of performance having to join across data sources using drill bits?
>> 
>> Ted said that I can accomplish a lot of this with Nested SQL and UDFs. Do
>> you think I will get the performance and flexibility I'm looking for? Could
>> Drill be a good platform to build this type of system on top of?
>> 
>> Also, here's a demo of interactive analysis using a small subset of the
>> GitHub Archive. It shows actions like pushes and repository creates and
>> then lets you follow what people do after those actions. It's currently
>> action only but future versions will let you segment by point-of-time state.
>> 
>> http://demo.skydb.io/
>> 
>> 
>> Ben Johnson
>> 
>> 
>> 
>> 

Reply via email to