I recently had lunch with Ted Dunning and we discussed machine learning and a 
database project I've been working on called Sky (http://skydb.io/). After 
talking with him for a while he suggested I look at Drill as being a possible 
platform to build my behavioral analysis tool on top of (instead of writing it 
myself) and post a message to the mailing list.

My database essentially stores actions and state changes for individual things. 
For example, you can store a web page visit or an e-mail open as an action and 
you can store someone's name or salary as a state and track that over time. 
It's basically like Datomic but actions can be tracked and data related to 
actions is not persisted outside the context of the action (e.g. a purchase 
amount only exists for an individual checkout action).

Another reason I wrote the database was that I needed my own way to process 
this data. SQL tends to work with relational data but doesn't let me relate a 
row in an "actions" table with another row that might be two days before or 
might be in the next web site session.

Performance is also a big factor. I'm able to run analysis on my data at around 
10MM events/core/second and I'm hoping to get up to around 50MM 
events/core/second once I do some profiling and optimization. Will I lose a lot 
of performance having to join across data sources using drill bits?

Ted said that I can accomplish a lot of this with Nested SQL and UDFs. Do you 
think I will get the performance and flexibility I'm looking for? Could Drill 
be a good platform to build this type of system on top of?

Also, here's a demo of interactive analysis using a small subset of the GitHub 
Archive. It shows actions like pushes and repository creates and then lets you 
follow what people do after those actions. It's currently action only but 
future versions will let you segment by point-of-time state.

http://demo.skydb.io/


Ben Johnson



Reply via email to