Jaques- Thank you very much for the explanation. That helps to answer a lot of my questions. I'm also aiming at doing full table scans (although I need to support sampling and per-record lookups sometimes) and my data is large enough to be spread across multiple machines. I'll have to do some more research around using a data flow model though.
Thanks again for taking the time to respond! Ben Johnson [email protected] On Feb 16, 2013, at 11:46 AM, Jacques Nadeau wrote: > It could be. Since you know more about your problem set than we do, here > are ways to think about Drill: > > - Drill is not a database but rather a query layer that works with a > number of underlying storage technologies > - Drill is designed primarily to do full table scans of all relevant > data as opposed to maintaing things like indexes. If you need single > record or point lookups, you're heavily dependent on the capabilities of > the underlying data storage engine > - Drill is designed for massive scale. If you're data can fit on one > computer now and for the foreseeable future, other options will probably > work better. > - Drill is designed specifically to be extensible through a well-defined > set of APIs. If your problem set extends beyond traditional dataflow > operators but can work in a dataflow model, Drill may be adapted to your > problem easily. > - One of these extensibility points is query language. Because Drill > has a concrete Logical Plan, you could develop an alternative DSL that > better reflects your needs without having to rebuild the entire distributed > processing system. > > Does that help at all? > > Thanks, > Jacques > > > On Fri, Feb 15, 2013 at 12:48 PM, Ben Johnson <[email protected]> wrote: > >> I recently had lunch with Ted Dunning and we discussed machine learning >> and a database project I've been working on called Sky (http://skydb.io/). >> After talking with him for a while he suggested I look at Drill as being a >> possible platform to build my behavioral analysis tool on top of (instead >> of writing it myself) and post a message to the mailing list. >> >> My database essentially stores actions and state changes for individual >> things. For example, you can store a web page visit or an e-mail open as an >> action and you can store someone's name or salary as a state and track that >> over time. It's basically like Datomic but actions can be tracked and data >> related to actions is not persisted outside the context of the action (e.g. >> a purchase amount only exists for an individual checkout action). >> >> Another reason I wrote the database was that I needed my own way to >> process this data. SQL tends to work with relational data but doesn't let >> me relate a row in an "actions" table with another row that might be two >> days before or might be in the next web site session. >> >> Performance is also a big factor. I'm able to run analysis on my data at >> around 10MM events/core/second and I'm hoping to get up to around 50MM >> events/core/second once I do some profiling and optimization. Will I lose a >> lot of performance having to join across data sources using drill bits? >> >> Ted said that I can accomplish a lot of this with Nested SQL and UDFs. Do >> you think I will get the performance and flexibility I'm looking for? Could >> Drill be a good platform to build this type of system on top of? >> >> Also, here's a demo of interactive analysis using a small subset of the >> GitHub Archive. It shows actions like pushes and repository creates and >> then lets you follow what people do after those actions. It's currently >> action only but future versions will let you segment by point-of-time state. >> >> http://demo.skydb.io/ >> >> >> Ben Johnson >> >> >> >>
