The Thrift IF predates vnodes. I agree that's a reasonable alternative. On Apr 2, 2014 12:47 PM, "Clint Kelly" <clint.ke...@gmail.com> wrote:
> Hi all, > > FWIW the HBase Hadoop InputFormat does not even do this kind of estimation > of data density over various ranges; it just creates one split for every > region between the start and stop keys of the scan. I'll probably just do > something similar by combining token ranges for virtual nodes that share > hosts and creating input splits that way. I think the previous approach I > had taken was overengineering this somewhat. > > Best regards, > Clint > > > > > > On Tue, Apr 1, 2014 at 2:08 PM, Aleksey Yeschenko <alek...@yeschenko.com > >wrote: > > > This doesn't belong to CQL-the language. > > > > However, this could be implemented as a virtual system column family - > > sooner or later we'd need something like this anyway. > > Then you'd just run SELECT's against it as if it were a regular column > > family. > > > > -- > > AY > > > > > > On Wednesday, April 2, 2014 at 00:03 AM, Tyler Hobbs wrote: > > > > > Split calculation can't be done client-side because it requires key > > > sampling (which requires reading the index summary). This would have to > > be > > > added to CQL. > > > > > > Since I can't see any alternatives and this is required for good Hadoop > > > support, would you mind opening a ticket to add support for this? > > > > > > > > > On Sun, Mar 30, 2014 at 8:31 PM, Clint Kelly <clint.ke...@gmail.com > (mailto: > > clint.ke...@gmail.com)> wrote: > > > > > > > Hi Shao-Chuan, > > > > > > > > I understand everything you said above except for how we can estimate > > the > > > > number of rows using the index interval. I understand that the index > > > > interval is a setting that controls how often samples from an SSTable > > index > > > > are stored in memory, correct? I was under the impression that this > is > > a > > > > property set in configuration.yaml and would not change as we add > rows > > to > > > > or delete rows from a table. > > > > > > > > BTW please let me know if this conversation belongs on the users > list. > > I > > > > don't want to spam the dev list, but this seems like something that > is > > kind > > > > of on the border between use and development. :) > > > > > > > > Best regards, > > > > Clint > > > > > > > > > > > > On Mon, Mar 24, 2014 at 4:13 PM, Shao-Chuan Wang < > > > > shaochuan.w...@bloomreach.com (mailto:shaochuan.w...@bloomreach.com > )> > > wrote: > > > > > > > > > Tyler mentioned that client.describe_ring(myKeyspace); can be > > replaced > > > > by a > > > > > query of system.peers table which has the ring information. The > > challenge > > > > > here is to describe_splits_ex which needs the estimate the number > of > > rows > > > > > in each sub token range (as you mentioned). > > > > > > > > > > From what I understand and trials and errors so far, I don't think > > > > Datastax > > > > > Java driver is able to do describe_splits_ex via a simple API call. > > If > > > > > > > > you > > > > > look at the implementation of CassandraServer.describe_splits_ex() > > and > > > > > StorageService.instance.getSplits(), what it does is that it is > > > > > > > > > > > > > splitting a > > > > > token range into several sub token ranges, with estimated row count > > in > > > > > > > > each > > > > > sub token rage. Inside StorageService.instance.getSplits() call, it > > is > > > > > adjusting split count based on a estimated row count, too. > > > > > StorageService.instance.getSplits() is only publicly exported by > > thrift. > > > > > > > > > > > > > It > > > > > would be non-trivial to re-build the same logic inside > > > > > StorageService.instance.getSplits(). > > > > > > > > > > That said, it looks like we could implement the splits logic at > > > > > AbstractColumnFamilyInputFormat.getSubSplits by querying > > > > > system.schema_columnfamilies and use CFMetaData.fromSchema to > > construct > > > > > CFMetaData. Inside CFMetaData it has the indexInterval which can be > > used > > > > > > > > > > > > > to > > > > > estimate row count, and the next thing is to mimic the logic in > > > > > StorageService.instance.getSplits() to divide token range into > > several > > > > > > > > > > > > > sub > > > > > token ranges and use TokenFactory (which is obtained from > > partitioner) to > > > > > construct sub token ranges at > > > > > > > > > > > > > AbstractColumnFamilyInputFormat.getSubSplits. > > > > > Basically, it is moving the splitting code from the server side to > > the > > > > > client side. > > > > > > > > > > Any thoughts? > > > > > > > > > > Shao-Chuan > > > > > > > > > > > > > > > On Mon, Mar 24, 2014 at 11:54 AM, Clint Kelly < > clint.ke...@gmail.com(mailto: > > clint.ke...@gmail.com)> > > > > > wrote: > > > > > > > > > > > I just saw this question about thrift in the Hadoop / Cassandra > > > > > integration > > > > > > in the discussion on the user list about freezing thrift. I have > > been > > > > > > working on a project to integrate Hadoop 2 and Cassandra 2 and > have > > > > > > > > > > > > > > > > > > > > > > > > been > > > > > > trying to move all of the way over to the Java driver and away > from > > > > > > > > > > thrift. > > > > > > > > > > > > I have finished most of the driver. It is still pretty rough, > but I > > > > have > > > > > > been using it for testing a prototype of the Kiji platfrom ( > > > > > > > > > > > > > www.kiji.org (http://www.kiji.org) > > > > > ) > > > > > > that uses Cassandra instead of HBase. > > > > > > > > > > > > One thing I have not been able to figure out is how to calculate > > input > > > > > > splits without thrift. I am currently doing the following: > > > > > > > > > > > > map = client.describe_ring(myKeyspace); > > > > > > > > > > > > (where client is of type Cassandra.Client). > > > > > > > > > > > > This call returns a list of token ranges (max and min token > > values) for > > > > > > different nodes in the cluster. We then use this information, > along > > > > > > > > > > > > > > > > > > > > > > > > with > > > > > > another thrift call, > > > > > > > > > > > > client.describe_splits_ex(cfName, range.start_token, > > > > range.end_token, > > > > > > splitSize); > > > > > > > > > > > > to estimate the number of rows in each token range, etc. > > > > > > > > > > > > I have looked all over the Java driver documentation and pinged > the > > > > user > > > > > > list and have not gotten any proposals that work for the Java > > driver. > > > > > > > > > > Does > > > > > > anyone here have any suggestions? > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Best regards, > > > > > > Clint > > > > > > > > > > > > > > > > > > On Tue, Mar 11, 2014 at 12:41 PM, Shao-Chuan Wang < > > > > > > shaochuan.w...@bloomreach.com (mailto: > > shaochuan.w...@bloomreach.com)> wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I just received this email from Jonathan regarding this > > deprecation > > > > of > > > > > > > thrift in 2.1 in dev emailing list. > > > > > > > > > > > > > > In fact, we migrated from thrift client to native one several > > months > > > > > ago; > > > > > > > however, in the Cassandra.hadoop, there are still a lot of > > > > > > > > > > > > > > > > > > > > > > > > dependencies > > > > > > on > > > > > > > thrift interface, for example describe_splits_ex in > > > > > > > org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat. > > > > > > > > > > > > > > Therefore, we had to keep thrift and native in our server but > > mainly, > > > > > the > > > > > > > CRUD query are through native protocol. > > > > > > > However, Jonathan says "*I don't know of any use cases for > Thrift > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > that > > > > > > > can't be **done in CQL"*. This statement makes me wonder maybe > > there > > > > > > > > > > > > > > > > > > > is > > > > > > > something I don't know about native protocol yet. > > > > > > > > > > > > > > So, does anyone know how to do "describing the splits" and > > > > "describing > > > > > > the > > > > > > > local rings" using native protocol? > > > > > > > > > > > > > > Also, cqlsh uses python client, which is talking via thrift > > protocol > > > > > too. > > > > > > > Does it mean that it will be migrated to native protocol soon > as > > > > > > > > > > > > > > > > > > > > > > > > well? > > > > > > > > > > > > > > Comments, pointers, suggestions are much appreciated. > > > > > > > > > > > > > > Many thanks, > > > > > > > > > > > > > > Shao-Chuan > > > > > > > > > > > > -- > > > Tyler Hobbs > > > DataStax <http://datastax.com/> > > > > > > > > > > > > >