The place that Drill could work with your db is with your db as a data source that allows pushdown. Drill can do the dataflow structuring and in cases where you could have done the whole thing in your database, you still will do it all in your database. Where you can't do it all inside your db, then some amount of data will exit to the data-flow managed by Drill and it can handle things.
On Mon, Feb 18, 2013 at 1:26 PM, Ben Johnson <[email protected]> wrote: > Jaques- > > Thank you very much for the explanation. That helps to answer a lot of my > questions. I'm also aiming at doing full table scans (although I need to > support sampling and per-record lookups sometimes) and my data is large > enough to be spread across multiple machines. I'll have to do some more > research around using a data flow model though. > > Thanks again for taking the time to respond! > > > Ben Johnson > [email protected] > > > > On Feb 16, 2013, at 11:46 AM, Jacques Nadeau wrote: > > > It could be. Since you know more about your problem set than we do, here > > are ways to think about Drill: > > > > - Drill is not a database but rather a query layer that works with a > > number of underlying storage technologies > > - Drill is designed primarily to do full table scans of all relevant > > data as opposed to maintaing things like indexes. If you need single > > record or point lookups, you're heavily dependent on the capabilities > of > > the underlying data storage engine > > - Drill is designed for massive scale. If you're data can fit on one > > computer now and for the foreseeable future, other options will > probably > > work better. > > - Drill is designed specifically to be extensible through a > well-defined > > set of APIs. If your problem set extends beyond traditional dataflow > > operators but can work in a dataflow model, Drill may be adapted to > your > > problem easily. > > - One of these extensibility points is query language. Because Drill > > has a concrete Logical Plan, you could develop an alternative DSL that > > better reflects your needs without having to rebuild the entire > distributed > > processing system. > > > > Does that help at all? > > > > Thanks, > > Jacques > > > > > > On Fri, Feb 15, 2013 at 12:48 PM, Ben Johnson <[email protected]> > wrote: > > > >> I recently had lunch with Ted Dunning and we discussed machine learning > >> and a database project I've been working on called Sky ( > http://skydb.io/). > >> After talking with him for a while he suggested I look at Drill as > being a > >> possible platform to build my behavioral analysis tool on top of > (instead > >> of writing it myself) and post a message to the mailing list. > >> > >> My database essentially stores actions and state changes for individual > >> things. For example, you can store a web page visit or an e-mail open > as an > >> action and you can store someone's name or salary as a state and track > that > >> over time. It's basically like Datomic but actions can be tracked and > data > >> related to actions is not persisted outside the context of the action > (e.g. > >> a purchase amount only exists for an individual checkout action). > >> > >> Another reason I wrote the database was that I needed my own way to > >> process this data. SQL tends to work with relational data but doesn't > let > >> me relate a row in an "actions" table with another row that might be two > >> days before or might be in the next web site session. > >> > >> Performance is also a big factor. I'm able to run analysis on my data at > >> around 10MM events/core/second and I'm hoping to get up to around 50MM > >> events/core/second once I do some profiling and optimization. Will I > lose a > >> lot of performance having to join across data sources using drill bits? > >> > >> Ted said that I can accomplish a lot of this with Nested SQL and UDFs. > Do > >> you think I will get the performance and flexibility I'm looking for? > Could > >> Drill be a good platform to build this type of system on top of? > >> > >> Also, here's a demo of interactive analysis using a small subset of the > >> GitHub Archive. It shows actions like pushes and repository creates and > >> then lets you follow what people do after those actions. It's currently > >> action only but future versions will let you segment by point-of-time > state. > >> > >> http://demo.skydb.io/ > >> > >> > >> Ben Johnson > >> > >> > >> > >> > >
