Hey Brian, Welcome to the list!
Here are some thoughts On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill <[email protected]>wrote: > Last week, Brad Anderson came up and presented at the PhillyDB meetup. > http://www.slideshare.net/boorad/phillydb-talk-beyond-batch > > He gave us an overview of Drill, and I'm curious... > > Presently, we heavily use Storm + Cassandra. > > http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html > > We treat CRUD operations as events. Then within Storm we calculate > aggregate counts of entities flowing through the system by various > dimensions. That works well, but we still need an ad hoc reporting > capability, and a way to report on data in the system that is not > active (historical). > > Seems like a great use case for Drill. > Would it be possible to use the Drill engine against a Cassandra backend? > If so, what does that mean? (implementing some API?) > Yes. One of our goals is to have a defined storage engine API with required and optional features to add new data sources. In fact, we have DRILL-16 which is dependent on DRILL-13 which specifically outlines this goal. DRILL-13 is the base API and DRILL-16 is the Cassandra implementation. Depending on your level of interest and time, we would love to have some help on DRILL-13. > > I assume that performance would be terrible unless somehow the data is > stored using the columnar data format from the Dremel paper. Is that > accurate? Does anyone know if anyone has attempted a translation of > that format to Cassandra? > > One of the visions behind Dremel and Drill are that full table scans are okay. Part of the reason is the compact format of the data and the fact that you only read important columns. I'd expect that for many schema designs, insitu-querying of Cassandra could be pretty effective. One of the things we've talked about is supporting caching transformations. E.g. the first time you query a source, it may be automatically reorganized in a more efficient format. This works really well with HDFS's write-once scheme. Harder with something like Cassandra depending on how your using it. > Regardless, I'm very interested in getting involved and no stranger to > getting my hands dirty. > Let me know if you can provide any direction. (our entities are > currently stored in JSON in Cassandra) > > As mentioned above, if you wanted to start a discussion and work on DRILL-13, that would be very helpful. Since we're still very much in alpha development right now, another helpful item would be to document your rough schema, available secondary indexes and example queries/needs on the wiki. You could then translate those into Drill Logical plan syntax. We could use these as earlier test cases to ensure the system will support these effectively. Welcome, Jacques > -brian > > > -- > Brian ONeill > Lead Architect, Health Market Science (http://healthmarketscience.com) > mobile:215.588.6024 > blog: http://brianoneill.blogspot.com/ > twitter: @boneill42 >
