I recently had lunch with Ted Dunning and we discussed machine learning and a database project I've been working on called Sky (http://skydb.io/). After talking with him for a while he suggested I look at Drill as being a possible platform to build my behavioral analysis tool on top of (instead of writing it myself) and post a message to the mailing list.
My database essentially stores actions and state changes for individual things. For example, you can store a web page visit or an e-mail open as an action and you can store someone's name or salary as a state and track that over time. It's basically like Datomic but actions can be tracked and data related to actions is not persisted outside the context of the action (e.g. a purchase amount only exists for an individual checkout action). Another reason I wrote the database was that I needed my own way to process this data. SQL tends to work with relational data but doesn't let me relate a row in an "actions" table with another row that might be two days before or might be in the next web site session. Performance is also a big factor. I'm able to run analysis on my data at around 10MM events/core/second and I'm hoping to get up to around 50MM events/core/second once I do some profiling and optimization. Will I lose a lot of performance having to join across data sources using drill bits? Ted said that I can accomplish a lot of this with Nested SQL and UDFs. Do you think I will get the performance and flexibility I'm looking for? Could Drill be a good platform to build this type of system on top of? Also, here's a demo of interactive analysis using a small subset of the GitHub Archive. It shows actions like pushes and repository creates and then lets you follow what people do after those actions. It's currently action only but future versions will let you segment by point-of-time state. http://demo.skydb.io/ Ben Johnson
