Hey crew. Thanks for all the useful replies. With respect to data model/selective queries: Understood. I am open to and anticipated creating wide-row indexes that would cut down on the range queries. With the right number of wide-row indexes that support the appropriate dimensions, we can probably cut down on the requisite full table scans.
I'm even open to creating a CF/table specifically to support the Dremel data model. (And I'm looking at the recent release of Cassandra native support for collections to see if they help with that approach) http://brianoneill.blogspot.com/2013/01/native-support-for-collections-in.h tml For cases where wide-rows can't be constructed (e.g. We can't fully anticipate the dimensions needed), we might be able to handle full-table scans if we made the Drill API implementation aware of the partitions/token-space in Cassandra. I saw that you mention locality on DRILL-13, vnode information from Cassandra might help there. With that, at least you could send the queries to the right host. (thinking outloud) Regardless, I can certainly come up with a straw-man data model that I believe is common in the Cassandra community, and we can brainstorm to see what makes sense. I'm certainly game for taking on DRILL-16 and contributing to DRILL-13. Solving this is a priority for us and Drill seems promising. I didn't see any pointers to the Storage Engine API on the issue. I've got the code down from github, but didn't see much: bone@zen:~/git/boneill42/incubator-drill/sandbox-> find . -name '*.java' | grep storage ./prototype/contrib/storage-hbase/src/main/java/org/apache/drill/App.java ./prototype/contrib/storage-hbase/src/test/java/org/apache/drill/AppTest.ja va Can anybody point me in the right direction? -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 <http://www.twitter.com/boneill42> healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 1/21/13 2:23 PM, "Jacques Nadeau" <[email protected]> wrote: >Hey Brian, > >Welcome to the list! > >Here are some thoughts > >On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill ><[email protected]>wrote: > >> Last week, Brad Anderson came up and presented at the PhillyDB meetup. >> http://www.slideshare.net/boorad/phillydb-talk-beyond-batch >> >> He gave us an overview of Drill, and I'm curious... >> >> Presently, we heavily use Storm + Cassandra. >> >> >>http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-a >>nd.html >> >> We treat CRUD operations as events. Then within Storm we calculate >> aggregate counts of entities flowing through the system by various >> dimensions. That works well, but we still need an ad hoc reporting >> capability, and a way to report on data in the system that is not >> active (historical). >> >> Seems like a great use case for Drill. > > >> Would it be possible to use the Drill engine against a Cassandra >>backend? >> If so, what does that mean? (implementing some API?) >> > >Yes. One of our goals is to have a defined storage engine API with >required and optional features to add new data sources. In fact, we have >DRILL-16 which is dependent on DRILL-13 which specifically outlines this >goal. DRILL-13 is the base API and DRILL-16 is the Cassandra >implementation. Depending on your level of interest and time, we would >love to have some help on DRILL-13. > >> >> I assume that performance would be terrible unless somehow the data is >> stored using the columnar data format from the Dremel paper. Is that >> accurate? Does anyone know if anyone has attempted a translation of >> that format to Cassandra? >> >> One of the visions behind Dremel and Drill are that full table scans are >okay. Part of the reason is the compact format of the data and the fact >that you only read important columns. I'd expect that for many schema >designs, insitu-querying of Cassandra could be pretty effective. > >One of the things we've talked about is supporting caching >transformations. > E.g. the first time you query a source, it may be automatically >reorganized in a more efficient format. This works really well with >HDFS's >write-once scheme. Harder with something like Cassandra depending on how >your using it. > > > >> Regardless, I'm very interested in getting involved and no stranger to >> getting my hands dirty. >> Let me know if you can provide any direction. (our entities are >> currently stored in JSON in Cassandra) >> >> >As mentioned above, if you wanted to start a discussion and work on >DRILL-13, that would be very helpful. Since we're still very much in >alpha >development right now, another helpful item would be to document your >rough >schema, available secondary indexes and example queries/needs on the wiki. > You could then translate those into Drill Logical plan syntax. We could >use these as earlier test cases to ensure the system will support these >effectively. > > >Welcome, > >Jacques > > > >> -brian >> >> >> -- >> Brian ONeill >> Lead Architect, Health Market Science (http://healthmarketscience.com) >> mobile:215.588.6024 >> blog: http://brianoneill.blogspot.com/ >> twitter: @boneill42 >>
