Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
“between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working” Can you confirm or disprove? My reading of the code is that it will consider the part of a token range (from vnodes or initial tokens) that overlap with the provided token range. I’ve already got one confirmation that in C* version I use (1.2.15) setting limits with setInputRange(startToken, endToken) doesn’t work. Can you be more specific ? work only for ordered partitioners (in 1.2.15). it will work with ordered and unordered partitioners equally. The difference is probably what you consider to “working” to mean. The token ranges are handled the same, it’s the row in them that changes. Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 20/05/2014, at 11:37 am, Anton Brazhnyk anton.brazh...@genesys.com wrote: Hi Aaron, I’ve seen the code which you describe (working with splits and intersections) but that range is derived from keys and work only for ordered partitioners (in 1.2.15). I’ve already got one confirmation that in C* version I use (1.2.15) setting limits with setInputRange(startToken, endToken) doesn’t work. “between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working” Can you confirm or disprove? WBR, Anton From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: Monday, May 19, 2014 1:58 AM To: Cassandra User Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat) The limit is just ignored and the entire column family is scanned. Which limit ? 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? From what I understand setting the input range is used when calculating the splits. The token ranges in the cluster are iterated and if they intersect with the supplied range the overlapping range is used to calculate the split. Rather than the full token range. 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? if you suppled a token range is that is 5% of the possible range of values for the token that should be close to a random 5% sample. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton
RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
I went with recommendations to create my own input format or backport the 2.0.7 code and it works now. To be more specific... AbstractColumnFamilyInputFormat. getSplits(JobContext) handled just the case with ordered partitioner and ranges based on keys. It did converted keys to tokens and used all the support which is there on low level (which you probably talk about). BUT there were no way to engage that support via ColumnFamilyInputFormat and ConfigHelper.setInputRange(startToken, endToken) prior to 2.0.7 without tapping into the code of C*. -Original Message- From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: Monday, May 19, 2014 11:58 PM To: Cassandra User Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat) between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working Can you confirm or disprove? My reading of the code is that it will consider the part of a token range (from vnodes or initial tokens) that overlap with the provided token range. I've already got one confirmation that in C* version I use (1.2.15) setting limits with setInputRange(startToken, endToken) doesn't work. Can you be more specific ? work only for ordered partitioners (in 1.2.15). it will work with ordered and unordered partitioners equally. The difference is probably what you consider to working to mean. The token ranges are handled the same, it's the row in them that changes. Cheers Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 20/05/2014, at 11:37 am, Anton Brazhnyk anton.brazh...@genesys.com wrote: Hi Aaron, I've seen the code which you describe (working with splits and intersections) but that range is derived from keys and work only for ordered partitioners (in 1.2.15). I've already got one confirmation that in C* version I use (1.2.15) setting limits with setInputRange(startToken, endToken) doesn't work. between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working Can you confirm or disprove? WBR, Anton From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: Monday, May 19, 2014 1:58 AM To: Cassandra User Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat) The limit is just ignored and the entire column family is scanned. Which limit ? 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? From what I understand setting the input range is used when calculating the splits. The token ranges in the cluster are iterated and if they intersect with the supplied range the overlapping range is used to calculate the split. Rather than the full token range. 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? if you suppled a token range is that is 5% of the possible range of values for the token that should be close to a random 5% sample. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton
Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
The limit is just ignored and the entire column family is scanned. Which limit ? 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? From what I understand setting the input range is used when calculating the splits. The token ranges in the cluster are iterated and if they intersect with the supplied range the overlapping range is used to calculate the split. Rather than the full token range. 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? if you suppled a token range is that is 5% of the possible range of values for the token that should be close to a random 5% sample. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton
RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
Hi Aaron, I've seen the code which you describe (working with splits and intersections) but that range is derived from keys and work only for ordered partitioners (in 1.2.15). I've already got one confirmation that in C* version I use (1.2.15) setting limits with setInputRange(startToken, endToken) doesn't work. between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working Can you confirm or disprove? WBR, Anton From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: Monday, May 19, 2014 1:58 AM To: Cassandra User Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat) The limit is just ignored and the entire column family is scanned. Which limit ? 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? From what I understand setting the input range is used when calculating the splits. The token ranges in the cluster are iterated and if they intersect with the supplied range the overlapping range is used to calculate the split. Rather than the full token range. 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? if you suppled a token range is that is 5% of the possible range of values for the token that should be close to a random 5% sample. Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.commailto:anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton
Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
Hello Anton, What version of Cassandra are you using? If between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working. This was fixed in 2.0.7: https://issues.apache.org/jira/browse/CASSANDRA-6436 If you can't upgrade you can copy AbstractCFIF and CFIF to your project and apply the patch there. Cheers, Paulo On Wed, May 14, 2014 at 10:29 PM, Anton Brazhnyk anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
Hi Anton, One approach you could look at is to write a custom InputFormat that allows you to limit the token range of rows that you fetch (if the AbstractColumnFamilyInputFormat does not do what you want). Doing so is not too much work. If you look at the class RowIterator within CqlRecordReader, you can see code in the constructor that creates a query with a certain token range: ResultSet rs = session.execute(cqlQuery, type.compose(type.fromString(split.getStartToken())), type.compose(type.fromString(split.getEndToken())) ); I think you can make a new version of the InputFormat and just tweak this method to achieve what you want. Alternatively, if you just want to get a sample of the data, you might want to change the InputFormat itself such that it chooses to query only a subset of the total input splits (or CfSplits). That might be easier. Best regards, Clint On Wed, May 14, 2014 at 6:29 PM, Anton Brazhnyk anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton
RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
Hi Paulo, I’m using C* 1.2.15 and have no easy option to upgrade (at least not to 2.0.* branch). I’ve started to look if I can implement my variant of InputFormat. Thanks a lot for the hint, I’m for sure will check how it’s done in 2.0.6 and if it’s possible to backport it to 1.2.* branch. WBR, Anton From: Paulo Ricardo Motta Gomes [mailto:paulo.mo...@chaordicsystems.com] Sent: Thursday, May 15, 2014 3:21 AM To: user@cassandra.apache.org Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat) Hello Anton, What version of Cassandra are you using? If between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working. This was fixed in 2.0.7: https://issues.apache.org/jira/browse/CASSANDRA-6436 If you can't upgrade you can copy AbstractCFIF and CFIF to your project and apply the patch there. Cheers, Paulo On Wed, May 14, 2014 at 10:29 PM, Anton Brazhnyk anton.brazh...@genesys.commailto:anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton -- Paulo Motta Chaordic | Platform www.chaordic.com.brhttp://www.chaordic.com.br/ +55 48 3232.3200