Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Jonathan Ellis Wed, 02 Apr 2014 11:15:32 -0700

The Thrift IF predates vnodes. I agree that's a reasonable alternative.
On Apr 2, 2014 12:47 PM, "Clint Kelly" <clint.ke...@gmail.com> wrote:


> Hi all,
>
> FWIW the HBase Hadoop InputFormat does not even do this kind of estimation
> of data density over various ranges; it just creates one split for every
> region between the start and stop keys of the scan.  I'll probably just do
> something similar by combining token ranges for virtual nodes that share
> hosts and creating input splits that way.  I think the previous approach I
> had taken was overengineering this somewhat.
>
> Best regards,
> Clint
>
>
>
>
>
> On Tue, Apr 1, 2014 at 2:08 PM, Aleksey Yeschenko <alek...@yeschenko.com
> >wrote:
>
> > This doesn't belong to CQL-the language.
> >
> > However, this could be implemented as a virtual system column family -
> > sooner or later we'd need something like this anyway.
> > Then you'd just run SELECT's against it as if it were a regular column
> > family.
> >
> > --
> > AY
> >
> >
> > On Wednesday, April 2, 2014 at 00:03 AM, Tyler Hobbs wrote:
> >
> > > Split calculation can't be done client-side because it requires key
> > > sampling (which requires reading the index summary). This would have to
> > be
> > > added to CQL.
> > >
> > > Since I can't see any alternatives and this is required for good Hadoop
> > > support, would you mind opening a ticket to add support for this?
> > >
> > >
> > > On Sun, Mar 30, 2014 at 8:31 PM, Clint Kelly <clint.ke...@gmail.com
> (mailto:
> > clint.ke...@gmail.com)> wrote:
> > >
> > > > Hi Shao-Chuan,
> > > >
> > > > I understand everything you said above except for how we can estimate
> > the
> > > > number of rows using the index interval. I understand that the index
> > > > interval is a setting that controls how often samples from an SSTable
> > index
> > > > are stored in memory, correct? I was under the impression that this
> is
> > a
> > > > property set in configuration.yaml and would not change as we add
> rows
> > to
> > > > or delete rows from a table.
> > > >
> > > > BTW please let me know if this conversation belongs on the users
> list.
> > I
> > > > don't want to spam the dev list, but this seems like something that
> is
> > kind
> > > > of on the border between use and development. :)
> > > >
> > > > Best regards,
> > > > Clint
> > > >
> > > >
> > > > On Mon, Mar 24, 2014 at 4:13 PM, Shao-Chuan Wang <
> > > > shaochuan.w...@bloomreach.com (mailto:shaochuan.w...@bloomreach.com
> )>
> > wrote:
> > > >
> > > > > Tyler mentioned that client.describe_ring(myKeyspace); can be
> > replaced
> > > > by a
> > > > > query of system.peers table which has the ring information. The
> > challenge
> > > > > here is to describe_splits_ex which needs the estimate the number
> of
> > rows
> > > > > in each sub token range (as you mentioned).
> > > > >
> > > > > From what I understand and trials and errors so far, I don't think
> > > > Datastax
> > > > > Java driver is able to do describe_splits_ex via a simple API call.
> > If
> > > >
> > > > you
> > > > > look at the implementation of CassandraServer.describe_splits_ex()
> > and
> > > > > StorageService.instance.getSplits(), what it does is that it is
> > > > >
> > > >
> > > > splitting a
> > > > > token range into several sub token ranges, with estimated row count
> > in
> > > >
> > > > each
> > > > > sub token rage. Inside StorageService.instance.getSplits() call, it
> > is
> > > > > adjusting split count based on a estimated row count, too.
> > > > > StorageService.instance.getSplits() is only publicly exported by
> > thrift.
> > > > >
> > > >
> > > > It
> > > > > would be non-trivial to re-build the same logic inside
> > > > > StorageService.instance.getSplits().
> > > > >
> > > > > That said, it looks like we could implement the splits logic at
> > > > > AbstractColumnFamilyInputFormat.getSubSplits by querying
> > > > > system.schema_columnfamilies and use CFMetaData.fromSchema to
> > construct
> > > > > CFMetaData. Inside CFMetaData it has the indexInterval which can be
> > used
> > > > >
> > > >
> > > > to
> > > > > estimate row count, and the next thing is to mimic the logic in
> > > > > StorageService.instance.getSplits() to divide token range into
> > several
> > > > >
> > > >
> > > > sub
> > > > > token ranges and use TokenFactory (which is obtained from
> > partitioner) to
> > > > > construct sub token ranges at
> > > > >
> > > >
> > > > AbstractColumnFamilyInputFormat.getSubSplits.
> > > > > Basically, it is moving the splitting code from the server side to
> > the
> > > > > client side.
> > > > >
> > > > > Any thoughts?
> > > > >
> > > > > Shao-Chuan
> > > > >
> > > > >
> > > > > On Mon, Mar 24, 2014 at 11:54 AM, Clint Kelly <
> clint.ke...@gmail.com(mailto:
> > clint.ke...@gmail.com)>
> > > > > wrote:
> > > > >
> > > > > > I just saw this question about thrift in the Hadoop / Cassandra
> > > > > integration
> > > > > > in the discussion on the user list about freezing thrift. I have
> > been
> > > > > > working on a project to integrate Hadoop 2 and Cassandra 2 and
> have
> > > > > >
> > > > >
> > > > >
> > > >
> > > > been
> > > > > > trying to move all of the way over to the Java driver and away
> from
> > > > >
> > > > > thrift.
> > > > > >
> > > > > > I have finished most of the driver. It is still pretty rough,
> but I
> > > > have
> > > > > > been using it for testing a prototype of the Kiji platfrom (
> > > > >
> > > >
> > > > www.kiji.org (http://www.kiji.org)
> > > > > )
> > > > > > that uses Cassandra instead of HBase.
> > > > > >
> > > > > > One thing I have not been able to figure out is how to calculate
> > input
> > > > > > splits without thrift. I am currently doing the following:
> > > > > >
> > > > > > map = client.describe_ring(myKeyspace);
> > > > > >
> > > > > > (where client is of type Cassandra.Client).
> > > > > >
> > > > > > This call returns a list of token ranges (max and min token
> > values) for
> > > > > > different nodes in the cluster. We then use this information,
> along
> > > > > >
> > > > >
> > > > >
> > > >
> > > > with
> > > > > > another thrift call,
> > > > > >
> > > > > > client.describe_splits_ex(cfName, range.start_token,
> > > > range.end_token,
> > > > > > splitSize);
> > > > > >
> > > > > > to estimate the number of rows in each token range, etc.
> > > > > >
> > > > > > I have looked all over the Java driver documentation and pinged
> the
> > > > user
> > > > > > list and have not gotten any proposals that work for the Java
> > driver.
> > > > >
> > > > > Does
> > > > > > anyone here have any suggestions?
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Best regards,
> > > > > > Clint
> > > > > >
> > > > > >
> > > > > > On Tue, Mar 11, 2014 at 12:41 PM, Shao-Chuan Wang <
> > > > > > shaochuan.w...@bloomreach.com (mailto:
> > shaochuan.w...@bloomreach.com)> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I just received this email from Jonathan regarding this
> > deprecation
> > > > of
> > > > > > > thrift in 2.1 in dev emailing list.
> > > > > > >
> > > > > > > In fact, we migrated from thrift client to native one several
> > months
> > > > > ago;
> > > > > > > however, in the Cassandra.hadoop, there are still a lot of
> > > > > >
> > > > >
> > > > >
> > > >
> > > > dependencies
> > > > > > on
> > > > > > > thrift interface, for example describe_splits_ex in
> > > > > > > org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.
> > > > > > >
> > > > > > > Therefore, we had to keep thrift and native in our server but
> > mainly,
> > > > > the
> > > > > > > CRUD query are through native protocol.
> > > > > > > However, Jonathan says "*I don't know of any use cases for
> Thrift
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > > that
> > > > > > > can't be **done in CQL"*. This statement makes me wonder maybe
> > there
> > > > > >
> > > > >
> > > >
> > > > is
> > > > > > > something I don't know about native protocol yet.
> > > > > > >
> > > > > > > So, does anyone know how to do "describing the splits" and
> > > > "describing
> > > > > > the
> > > > > > > local rings" using native protocol?
> > > > > > >
> > > > > > > Also, cqlsh uses python client, which is talking via thrift
> > protocol
> > > > > too.
> > > > > > > Does it mean that it will be migrated to native protocol soon
> as
> > > > > >
> > > > >
> > > > >
> > > >
> > > > well?
> > > > > > >
> > > > > > > Comments, pointers, suggestions are much appreciated.
> > > > > > >
> > > > > > > Many thanks,
> > > > > > >
> > > > > > > Shao-Chuan
> > >
> > >
> > >
> > > --
> > > Tyler Hobbs
> > > DataStax <http://datastax.com/>
> > >
> > >
> >
> >
> >
>

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Reply via email to