Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-20 Thread Aaron Morton
 “between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
 working”
 Can you confirm or disprove?


My reading of the code is that it will consider the part of a token range (from 
vnodes or initial tokens) that overlap with the provided token range. 

 I’ve already got one confirmation that in C* version I use (1.2.15) setting 
 limits with setInputRange(startToken, endToken) doesn’t work.
Can you be more specific ?

 work only for ordered partitioners (in 1.2.15).

it will work with ordered and unordered partitioners equally. The difference is 
probably what you consider to “working” to mean.  The token ranges are handled 
the same, it’s the row in them that changes. 

Cheers
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 20/05/2014, at 11:37 am, Anton Brazhnyk anton.brazh...@genesys.com wrote:

 Hi Aaron,
  
 I’ve seen the code which you describe (working with splits and intersections) 
 but that range is derived from keys and work only for ordered partitioners 
 (in 1.2.15).
 I’ve already got one confirmation that in C* version I use (1.2.15) setting 
 limits with setInputRange(startToken, endToken) doesn’t work.
 “between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
 working”
 Can you confirm or disprove?
  
 WBR,
 Anton
  
 From: Aaron Morton [mailto:aa...@thelastpickle.com] 
 Sent: Monday, May 19, 2014 1:58 AM
 To: Cassandra User
 Subject: Re: Cassandra token range support for Hadoop 
 (ColumnFamilyInputFormat)
  
 The limit is just ignored and the entire column family is scanned.
 Which limit ? 
 
 
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 From what I understand setting the input range is used when calculating the 
 splits. The token ranges in the cluster are iterated and if they intersect 
 with the supplied range the overlapping range is used to calculate the split. 
 Rather than the full token range. 
  
 2. Is there other way to limit the amount of data read from Cassandra with 
 Spark and ColumnFamilyInputFormat,
 so that this amount is predictable (like 5% of entire dataset)?
 if you suppled a token range is that is 5% of the possible range of values 
 for the token that should be close to a random 5% sample. 
  
  
 Hope that helps. 
 Aaron
  
 -
 Aaron Morton
 New Zealand
 @aaronmorton
  
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
  
 On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote:
 
 
 Greetings,
 
 I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd 
 like to read just part of it - something like Spark's sample() function.
 Cassandra's API seems allow to do it with its 
 ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, 
 but it doesn't work.
 The limit is just ignored and the entire column family is scanned. It seems 
 this kind of feature is just not supported 
 and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
 Questions:
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 2. Is there other way to limit the amount of data read from Cassandra with 
 Spark and ColumnFamilyInputFormat,
 so that this amount is predictable (like 5% of entire dataset)?
 
 
 WBR,
 Anton
 



RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-20 Thread Anton Brazhnyk
I went with recommendations to create my own input format or backport the 2.0.7 
code and it works now.
To be more specific...
AbstractColumnFamilyInputFormat. getSplits(JobContext) handled just the case 
with ordered partitioner and ranges based on keys.
It did converted keys to tokens and used all the support which is there on low 
level (which you probably talk about).
BUT there were no way to engage that support via ColumnFamilyInputFormat and 
ConfigHelper.setInputRange(startToken, endToken)
prior to 2.0.7 without tapping into the code of C*.


-Original Message-
From: Aaron Morton [mailto:aa...@thelastpickle.com] 
Sent: Monday, May 19, 2014 11:58 PM
To: Cassandra User
Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

 between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
 working
 Can you confirm or disprove?


My reading of the code is that it will consider the part of a token range (from 
vnodes or initial tokens) that overlap with the provided token range. 

 I've already got one confirmation that in C* version I use (1.2.15) setting 
 limits with setInputRange(startToken, endToken) doesn't work.
Can you be more specific ?

 work only for ordered partitioners (in 1.2.15).

it will work with ordered and unordered partitioners equally. The difference is 
probably what you consider to working to mean.  The token ranges are handled 
the same, it's the row in them that changes. 

Cheers
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 20/05/2014, at 11:37 am, Anton Brazhnyk anton.brazh...@genesys.com wrote:

 Hi Aaron,
  
 I've seen the code which you describe (working with splits and intersections) 
 but that range is derived from keys and work only for ordered partitioners 
 (in 1.2.15).
 I've already got one confirmation that in C* version I use (1.2.15) setting 
 limits with setInputRange(startToken, endToken) doesn't work.
 between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
 working
 Can you confirm or disprove?
  
 WBR,
 Anton
  
 From: Aaron Morton [mailto:aa...@thelastpickle.com]
 Sent: Monday, May 19, 2014 1:58 AM
 To: Cassandra User
 Subject: Re: Cassandra token range support for Hadoop 
 (ColumnFamilyInputFormat)
  
 The limit is just ignored and the entire column family is scanned.
 Which limit ? 
 
 
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 From what I understand setting the input range is used when calculating the 
 splits. The token ranges in the cluster are iterated and if they intersect 
 with the supplied range the overlapping range is used to calculate the split. 
 Rather than the full token range. 
  
 2. Is there other way to limit the amount of data read from Cassandra 
 with Spark and ColumnFamilyInputFormat, so that this amount is predictable 
 (like 5% of entire dataset)?
 if you suppled a token range is that is 5% of the possible range of values 
 for the token that should be close to a random 5% sample. 
  
  
 Hope that helps. 
 Aaron
  
 -
 Aaron Morton
 New Zealand
 @aaronmorton
  
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
  
 On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote:
 
 
 Greetings,
 
 I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd 
 like to read just part of it - something like Spark's sample() function.
 Cassandra's API seems allow to do it with its 
 ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, 
 but it doesn't work.
 The limit is just ignored and the entire column family is scanned. It 
 seems this kind of feature is just not supported and sources of 
 AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
 Questions:
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 2. Is there other way to limit the amount of data read from Cassandra 
 with Spark and ColumnFamilyInputFormat, so that this amount is predictable 
 (like 5% of entire dataset)?
 
 
 WBR,
 Anton
 




Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-19 Thread Aaron Morton
 The limit is just ignored and the entire column family is scanned.
Which limit ? 

 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
From what I understand setting the input range is used when calculating the 
splits. The token ranges in the cluster are iterated and if they intersect with 
the supplied range the overlapping range is used to calculate the split. Rather 
than the full token range. 

 2. Is there other way to limit the amount of data read from Cassandra with 
 Spark and ColumnFamilyInputFormat,
 so that this amount is predictable (like 5% of entire dataset)?
if you suppled a token range is that is 5% of the possible range of values for 
the token that should be close to a random 5% sample. 


Hope that helps. 
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote:

 Greetings,
 
 I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd 
 like to read just part of it - something like Spark's sample() function.
 Cassandra's API seems allow to do it with its 
 ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, 
 but it doesn't work.
 The limit is just ignored and the entire column family is scanned. It seems 
 this kind of feature is just not supported 
 and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
 Questions:
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 2. Is there other way to limit the amount of data read from Cassandra with 
 Spark and ColumnFamilyInputFormat,
 so that this amount is predictable (like 5% of entire dataset)?
 
 
 WBR,
 Anton
 
 



RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-19 Thread Anton Brazhnyk
Hi Aaron,

I've seen the code which you describe (working with splits and intersections) 
but that range is derived from keys and work only for ordered partitioners (in 
1.2.15).
I've already got one confirmation that in C* version I use (1.2.15) setting 
limits with setInputRange(startToken, endToken) doesn't work.
between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working
Can you confirm or disprove?

WBR,
Anton

From: Aaron Morton [mailto:aa...@thelastpickle.com]
Sent: Monday, May 19, 2014 1:58 AM
To: Cassandra User
Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

The limit is just ignored and the entire column family is scanned.
Which limit ?


1. Am I right that there is no way to get some data limited by token range with 
ColumnFamilyInputFormat?
From what I understand setting the input range is used when calculating the 
splits. The token ranges in the cluster are iterated and if they intersect 
with the supplied range the overlapping range is used to calculate the split. 
Rather than the full token range.

2. Is there other way to limit the amount of data read from Cassandra with 
Spark and ColumnFamilyInputFormat,
so that this amount is predictable (like 5% of entire dataset)?
if you suppled a token range is that is 5% of the possible range of values for 
the token that should be close to a random 5% sample.


Hope that helps.
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 14/05/2014, at 10:46 am, Anton Brazhnyk 
anton.brazh...@genesys.commailto:anton.brazh...@genesys.com wrote:


Greetings,

I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like 
to read just part of it - something like Spark's sample() function.
Cassandra's API seems allow to do it with its 
ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but 
it doesn't work.
The limit is just ignored and the entire column family is scanned. It seems 
this kind of feature is just not supported
and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
Questions:
1. Am I right that there is no way to get some data limited by token range with 
ColumnFamilyInputFormat?
2. Is there other way to limit the amount of data read from Cassandra with 
Spark and ColumnFamilyInputFormat,
so that this amount is predictable (like 5% of entire dataset)?


WBR,
Anton



Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-16 Thread Paulo Ricardo Motta Gomes
Hello Anton,

What version of Cassandra are you using? If between 1.2.6 and 2.0.6 the
setInputRange(startToken, endToken) is not working.

This was fixed in 2.0.7:
https://issues.apache.org/jira/browse/CASSANDRA-6436

If you can't upgrade you can copy AbstractCFIF and CFIF to your project and
apply the patch there.

Cheers,

Paulo


On Wed, May 14, 2014 at 10:29 PM, Anton Brazhnyk anton.brazh...@genesys.com
 wrote:

 Greetings,

 I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd
 like to read just part of it - something like Spark's sample() function.
 Cassandra's API seems allow to do it with its
 ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method,
 but it doesn't work.
 The limit is just ignored and the entire column family is scanned. It
 seems this kind of feature is just not supported
 and sources of AbstractColumnFamilyInputFormat.getSplits confirm that
 (IMO).
 Questions:
 1. Am I right that there is no way to get some data limited by token range
 with ColumnFamilyInputFormat?
 2. Is there other way to limit the amount of data read from Cassandra with
 Spark and ColumnFamilyInputFormat,
 so that this amount is predictable (like 5% of entire dataset)?


 WBR,
 Anton





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-16 Thread Clint Kelly
Hi Anton,

One approach you could look at is to write a custom InputFormat that
allows you to limit the token range of rows that you fetch (if the
AbstractColumnFamilyInputFormat does not do what you want).  Doing so
is not too much work.

If you look at the class RowIterator within CqlRecordReader, you can
see code in the constructor that creates a query with a certain token
range:

ResultSet rs = session.execute(cqlQuery,
type.compose(type.fromString(split.getStartToken())),
type.compose(type.fromString(split.getEndToken())) );

 I think you can make a new version of the InputFormat and just tweak
this method to achieve what you want.  Alternatively, if you just want
to get a sample of the data, you might want to change the InputFormat
itself such that it chooses to query only a subset of the total input
splits (or CfSplits).  That might be easier.

Best regards,
Clint

On Wed, May 14, 2014 at 6:29 PM, Anton Brazhnyk
anton.brazh...@genesys.com wrote:
 Greetings,

 I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd 
 like to read just part of it - something like Spark's sample() function.
 Cassandra's API seems allow to do it with its 
 ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, 
 but it doesn't work.
 The limit is just ignored and the entire column family is scanned. It seems 
 this kind of feature is just not supported
 and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
 Questions:
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 2. Is there other way to limit the amount of data read from Cassandra with 
 Spark and ColumnFamilyInputFormat,
 so that this amount is predictable (like 5% of entire dataset)?


 WBR,
 Anton




RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-16 Thread Anton Brazhnyk
Hi Paulo,

I’m using C* 1.2.15 and have no easy option to upgrade (at least not to 2.0.* 
branch).
I’ve started to look if I can implement my variant of InputFormat.
Thanks a lot for the hint, I’m for sure will check how it’s done in 2.0.6 and 
if it’s possible to backport it to 1.2.* branch.


WBR,
Anton

From: Paulo Ricardo Motta Gomes [mailto:paulo.mo...@chaordicsystems.com]
Sent: Thursday, May 15, 2014 3:21 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

Hello Anton,

What version of Cassandra are you using? If between 1.2.6 and 2.0.6 the 
setInputRange(startToken, endToken) is not working.

This was fixed in 2.0.7: https://issues.apache.org/jira/browse/CASSANDRA-6436

If you can't upgrade you can copy AbstractCFIF and CFIF to your project and 
apply the patch there.

Cheers,

Paulo

On Wed, May 14, 2014 at 10:29 PM, Anton Brazhnyk 
anton.brazh...@genesys.commailto:anton.brazh...@genesys.com wrote:
Greetings,

I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like 
to read just part of it - something like Spark's sample() function.
Cassandra's API seems allow to do it with its 
ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but 
it doesn't work.
The limit is just ignored and the entire column family is scanned. It seems 
this kind of feature is just not supported
and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
Questions:
1. Am I right that there is no way to get some data limited by token range with 
ColumnFamilyInputFormat?
2. Is there other way to limit the amount of data read from Cassandra with 
Spark and ColumnFamilyInputFormat,
so that this amount is predictable (like 5% of entire dataset)?


WBR,
Anton




--
Paulo Motta

Chaordic | Platform
www.chaordic.com.brhttp://www.chaordic.com.br/
+55 48 3232.3200