Hey Brian, Yeah, the storage engine APIs haven't been defined yet. Expounding a bit on the high-level goals include what we had in the JIRA:
The primary interface is the Storage Engine Capabilities API. It should describe everything that the particular storage engine supports. This includes whether the storage engine supports serialization, deserialization, what types of logical operator capabilities it supports internally. It also needs to include a description of statistics capabilities (e.g. supports approximate row keys, average row size, total data size, data distribution statistics, etc) and metadata capabilities Statistics API: Provide the actual statistics information that is utilized during query planning. Metadata API: Provide information about the available sub data sources (tables, keyspaces, etc) along with locality information, schema information, type information, primary and secondary indices types, partitioning information, etc. Portions of this information are used in query parsing. Others in query planning. Others portions in Execution planning. DeserializationAPI - Convert a particular data source into one of our two canonical in-memory formats. (row-based or column-based). Additionally support particular types of logical operation pushdown. Serialization - Serialize the in-memory format back into the persistent storage format. If you wanted to take a look at other projects existing interfaces around each of these things and then try to draw up a design, that would be really helpful. Jacques On Mon, Jan 21, 2013 at 8:20 PM, Brian O'Neill <[email protected]>wrote: > > Hey crew. Thanks for all the useful replies. > > With respect to data model/selective queries: > Understood. I am open to and anticipated creating wide-row indexes that > would cut down on the range queries. With the right number of wide-row > indexes that support the appropriate dimensions, we can probably cut down > on the requisite full table scans. > > I'm even open to creating a CF/table specifically to support the Dremel > data model. (And I'm looking at the recent release of Cassandra native > support for collections to see if they help with that approach) > http://brianoneill.blogspot.com/2013/01/native-support-for-collections-in.h > tml<http://brianoneill.blogspot.com/2013/01/native-support-for-collections-in.html> > > > For cases where wide-rows can't be constructed (e.g. We can't fully > anticipate the dimensions needed), we might be able to handle full-table > scans if we made the Drill API implementation aware of the > partitions/token-space in Cassandra. I saw that you mention locality on > DRILL-13, vnode information from Cassandra might help there. With that, at > least you could send the queries to the right host. > (thinking outloud) > > Regardless, I can certainly come up with a straw-man data model that I > believe is common in the Cassandra community, and we can brainstorm to see > what makes sense. > > I'm certainly game for taking on DRILL-16 and contributing to DRILL-13. > Solving this is a priority for us and Drill seems promising. > > I didn't see any pointers to the Storage Engine API on the issue. I've > got the code down from github, but didn't see much: > bone@zen:~/git/boneill42/incubator-drill/sandbox-> find . -name '*.java' | > grep storage > ./prototype/contrib/storage-hbase/src/main/java/org/apache/drill/App.java > ./prototype/contrib/storage-hbase/src/test/java/org/apache/drill/AppTest.ja > va > > Can anybody point me in the right direction? > > -brian > > > > > --- > Brian O'Neill > Lead Architect, Software Development > Health Market Science > The Science of Better Results > 2700 Horizon Drive € King of Prussia, PA € 19406 > M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42> € > healthmarketscience.com > > This information transmitted in this email message is for the intended > recipient only and may contain confidential and/or privileged material. If > you received this email in error and are not the intended recipient, or > the person responsible to deliver it to the intended recipient, please > contact the sender at the email above and delete this email and any > attachments and destroy any copies thereof. Any review, retransmission, > dissemination, copying or other use of, or taking any action in reliance > upon, this information by persons or entities other than the intended > recipient is strictly prohibited. > > > > > > > > On 1/21/13 2:23 PM, "Jacques Nadeau" <[email protected]> wrote: > > >Hey Brian, > > > >Welcome to the list! > > > >Here are some thoughts > > > >On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill > ><[email protected]>wrote: > > > >> Last week, Brad Anderson came up and presented at the PhillyDB meetup. > >> http://www.slideshare.net/boorad/phillydb-talk-beyond-batch > >> > >> He gave us an overview of Drill, and I'm curious... > >> > >> Presently, we heavily use Storm + Cassandra. > >> > >> > >> > http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-a > >>nd.html > >> > >> We treat CRUD operations as events. Then within Storm we calculate > >> aggregate counts of entities flowing through the system by various > >> dimensions. That works well, but we still need an ad hoc reporting > >> capability, and a way to report on data in the system that is not > >> active (historical). > >> > >> Seems like a great use case for Drill. > > > > > >> Would it be possible to use the Drill engine against a Cassandra > >>backend? > >> If so, what does that mean? (implementing some API?) > >> > > > >Yes. One of our goals is to have a defined storage engine API with > >required and optional features to add new data sources. In fact, we have > >DRILL-16 which is dependent on DRILL-13 which specifically outlines this > >goal. DRILL-13 is the base API and DRILL-16 is the Cassandra > >implementation. Depending on your level of interest and time, we would > >love to have some help on DRILL-13. > > > >> > >> I assume that performance would be terrible unless somehow the data is > >> stored using the columnar data format from the Dremel paper. Is that > >> accurate? Does anyone know if anyone has attempted a translation of > >> that format to Cassandra? > >> > >> One of the visions behind Dremel and Drill are that full table scans are > >okay. Part of the reason is the compact format of the data and the fact > >that you only read important columns. I'd expect that for many schema > >designs, insitu-querying of Cassandra could be pretty effective. > > > >One of the things we've talked about is supporting caching > >transformations. > > E.g. the first time you query a source, it may be automatically > >reorganized in a more efficient format. This works really well with > >HDFS's > >write-once scheme. Harder with something like Cassandra depending on how > >your using it. > > > > > > > >> Regardless, I'm very interested in getting involved and no stranger to > >> getting my hands dirty. > >> Let me know if you can provide any direction. (our entities are > >> currently stored in JSON in Cassandra) > >> > >> > >As mentioned above, if you wanted to start a discussion and work on > >DRILL-13, that would be very helpful. Since we're still very much in > >alpha > >development right now, another helpful item would be to document your > >rough > >schema, available secondary indexes and example queries/needs on the wiki. > > You could then translate those into Drill Logical plan syntax. We could > >use these as earlier test cases to ensure the system will support these > >effectively. > > > > > >Welcome, > > > >Jacques > > > > > > > >> -brian > >> > >> > >> -- > >> Brian ONeill > >> Lead Architect, Health Market Science (http://healthmarketscience.com) > >> mobile:215.588.6024 > >> blog: http://brianoneill.blogspot.com/ > >> twitter: @boneill42 > >> > > >
