Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-20 Thread Aaron Morton
 “between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
 working”
 Can you confirm or disprove?


My reading of the code is that it will consider the part of a token range (from 
vnodes or initial tokens) that overlap with the provided token range. 

 I’ve already got one confirmation that in C* version I use (1.2.15) setting 
 limits with setInputRange(startToken, endToken) doesn’t work.
Can you be more specific ?

 work only for ordered partitioners (in 1.2.15).

it will work with ordered and unordered partitioners equally. The difference is 
probably what you consider to “working” to mean.  The token ranges are handled 
the same, it’s the row in them that changes. 

Cheers
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 20/05/2014, at 11:37 am, Anton Brazhnyk anton.brazh...@genesys.com wrote:

 Hi Aaron,
  
 I’ve seen the code which you describe (working with splits and intersections) 
 but that range is derived from keys and work only for ordered partitioners 
 (in 1.2.15).
 I’ve already got one confirmation that in C* version I use (1.2.15) setting 
 limits with setInputRange(startToken, endToken) doesn’t work.
 “between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
 working”
 Can you confirm or disprove?
  
 WBR,
 Anton
  
 From: Aaron Morton [mailto:aa...@thelastpickle.com] 
 Sent: Monday, May 19, 2014 1:58 AM
 To: Cassandra User
 Subject: Re: Cassandra token range support for Hadoop 
 (ColumnFamilyInputFormat)
  
 The limit is just ignored and the entire column family is scanned.
 Which limit ? 
 
 
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 From what I understand setting the input range is used when calculating the 
 splits. The token ranges in the cluster are iterated and if they intersect 
 with the supplied range the overlapping range is used to calculate the split. 
 Rather than the full token range. 
  
 2. Is there other way to limit the amount of data read from Cassandra with 
 Spark and ColumnFamilyInputFormat,
 so that this amount is predictable (like 5% of entire dataset)?
 if you suppled a token range is that is 5% of the possible range of values 
 for the token that should be close to a random 5% sample. 
  
  
 Hope that helps. 
 Aaron
  
 -
 Aaron Morton
 New Zealand
 @aaronmorton
  
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
  
 On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote:
 
 
 Greetings,
 
 I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd 
 like to read just part of it - something like Spark's sample() function.
 Cassandra's API seems allow to do it with its 
 ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, 
 but it doesn't work.
 The limit is just ignored and the entire column family is scanned. It seems 
 this kind of feature is just not supported 
 and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
 Questions:
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 2. Is there other way to limit the amount of data read from Cassandra with 
 Spark and ColumnFamilyInputFormat,
 so that this amount is predictable (like 5% of entire dataset)?
 
 
 WBR,
 Anton
 



Re: CQL 3 and wide rows

2014-05-20 Thread Aaron Morton
In a CQL 3 table the only **column** names are the ones defined in the table, 
in the example below there are three column names. 


 CREATE TABLE keyspace.widerow (
 row_key text,
 wide_row_column text,
 data_column text,
 PRIMARY KEY (row_key, wide_row_column));
 
 Check out, for example, 
 http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​

Internally there may be more **cells** ( as we now call the internal columns). 
In the example above each value for row_key will create a single partition (as 
we now call internal storage engine rows). In each of those partitions there 
will be cells for each CQL 3 row that has the same row_key, those cells will 
use a Composite for the name. The first part of the composite will be the value 
of the wide_row_column and the second will be the literal name of the non 
primary key columns. 

IMHO Wide partitions (storage engine rows) are more prevalent in CQL3 than 
thrift models. 

 But still - I do not see Iteration, so it looks to me that CQL 3 is limited 
 when compared to CLI/Hector.
Now days you can do pretty much everything you can in cli. Provide an example 
and we may be able to help. 

Cheers
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 20/05/2014, at 8:18 am, Maciej Miklas mac.mik...@gmail.com wrote:

 Hi James,
 
 Clustering is based on rows. I think that you meant not clustering columns, 
 but compound columns. Still all columns belong to single table and are stored 
 within single folder on one computer. And it looks to me (but I’am not sure) 
 that CQL 3 driver loads all column names into memory - which is confusing to 
 me. From one side we have wide row, but we load whole into ram…..
 
 My understanding of wide row is a row that supports millions of columns, or 
 similar things like map or set. In CLI you would generate column names (or 
 use compound columns) to simulate set or map,  in CQL 3 you would use some 
 static names plus Map or Set structures, or you could still alter table and 
 have large number of columns. But still - I do not see Iteration, so it looks 
 to me that CQL 3 is limited when compared to CLI/Hector.
 
 
 Regards,
 Maciej
 
 On 19 May 2014, at 17:30, James Campbell ja...@breachintelligence.com wrote:
 
 Maciej,
 
 In CQL3 wide rows are expected to be created using clustering columns.  So 
 while the schema will have a relatively smaller number of named columns, the 
 effect is a wide row.  For example:
 
 CREATE TABLE keyspace.widerow (
 row_key text,
 wide_row_column text,
 data_column text,
 PRIMARY KEY (row_key, wide_row_column));
 
 Check out, for example, 
 http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
 
 James
 From: Maciej Miklas mac.mik...@gmail.com
 Sent: Monday, May 19, 2014 11:20 AM
 To: user@cassandra.apache.org
 Subject: CQL 3 and wide rows
  
 Hi *,
 
 I’ve checked DataStax driver code for CQL 3, and it looks like the column 
 names for particular table are fully loaded into memory, it this true?
 
 Cassandra should support wide rows, meaning tables with millions of columns. 
 Knowing that, I would expect kind of iterator for column names. Am I missing 
 something here? 
 
 
 Regards,
 Maciej Miklas
 



Re: CQL 3 and wide rows

2014-05-20 Thread Jack Krupansky
To keep the terminology clear, your “row_key” is actually the “partition key”, 
and “wide_row_column” is actually a “clustering column”, and the combination of 
your row_key and wide_row_column is a “compound primary key”.

-- Jack Krupansky

From: Aaron Morton 
Sent: Tuesday, May 20, 2014 3:06 AM
To: Cassandra User 
Subject: Re: CQL 3 and wide rows

In a CQL 3 table the only **column** names are the ones defined in the table, 
in the example below there are three column names.  


CREATE TABLE keyspace.widerow (

row_key text,

wide_row_column text,

data_column text,

PRIMARY KEY (row_key, wide_row_column));


Check out, for example, 
http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​

Internally there may be more **cells** ( as we now call the internal columns). 
In the example above each value for row_key will create a single partition (as 
we now call internal storage engine rows). In each of those partitions there 
will be cells for each CQL 3 row that has the same row_key, those cells will 
use a Composite for the name. The first part of the composite will be the value 
of the wide_row_column and the second will be the literal name of the non 
primary key columns. 

IMHO Wide partitions (storage engine rows) are more prevalent in CQL3 than 
thrift models. 

  But still - I do not see Iteration, so it looks to me that CQL 3 is limited 
when compared to CLI/Hector.
Now days you can do pretty much everything you can in cli. Provide an example 
and we may be able to help. 

Cheers
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 20/05/2014, at 8:18 am, Maciej Miklas mac.mik...@gmail.com wrote:


  Hi James, 

  Clustering is based on rows. I think that you meant not clustering columns, 
but compound columns. Still all columns belong to single table and are stored 
within single folder on one computer. And it looks to me (but I’am not sure) 
that CQL 3 driver loads all column names into memory - which is confusing to 
me. From one side we have wide row, but we load whole into ram…..

  My understanding of wide row is a row that supports millions of columns, or 
similar things like map or set. In CLI you would generate column names (or use 
compound columns) to simulate set or map,  in CQL 3 you would use some static 
names plus Map or Set structures, or you could still alter table and have large 
number of columns. But still - I do not see Iteration, so it looks to me that 
CQL 3 is limited when compared to CLI/Hector.


  Regards,
  Maciej

  On 19 May 2014, at 17:30, James Campbell ja...@breachintelligence.com wrote:


Maciej,


In CQL3 wide rows are expected to be created using clustering columns.  
So while the schema will have a relatively smaller number of named columns, the 
effect is a wide row.  For example:


CREATE TABLE keyspace.widerow (

row_key text,

wide_row_column text,

data_column text,

PRIMARY KEY (row_key, wide_row_column));


Check out, for example, 
http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​


James




From: Maciej Miklas mac.mik...@gmail.com
Sent: Monday, May 19, 2014 11:20 AM
To: user@cassandra.apache.org
Subject: CQL 3 and wide rows 

Hi *, 

I’ve checked DataStax driver code for CQL 3, and it looks like the column 
names for particular table are fully loaded into memory, it this true?

Cassandra should support wide rows, meaning tables with millions of 
columns. Knowing that, I would expect kind of iterator for column names. Am I 
missing something here? 


Regards,
Maciej Miklas



Re: CQL 3 and wide rows

2014-05-20 Thread Maciej Miklas
yes :)

On 20 May 2014, at 14:24, Jack Krupansky j...@basetechnology.com wrote:

 To keep the terminology clear, your “row_key” is actually the “partition 
 key”, and “wide_row_column” is actually a “clustering column”, and the 
 combination of your row_key and wide_row_column is a “compound primary key”.
  
 -- Jack Krupansky
  
 From: Aaron Morton
 Sent: Tuesday, May 20, 2014 3:06 AM
 To: Cassandra User
 Subject: Re: CQL 3 and wide rows
  
 In a CQL 3 table the only **column** names are the ones defined in the table, 
 in the example below there are three column names. 
  
  
 CREATE TABLE keyspace.widerow (
 row_key text,
 wide_row_column text,
 data_column text,
 PRIMARY KEY (row_key, wide_row_column));
  
 Check out, for example, 
 http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
  
 Internally there may be more **cells** ( as we now call the internal 
 columns). In the example above each value for row_key will create a single 
 partition (as we now call internal storage engine rows). In each of those 
 partitions there will be cells for each CQL 3 row that has the same row_key, 
 those cells will use a Composite for the name. The first part of the 
 composite will be the value of the wide_row_column and the second will be the 
 literal name of the non primary key columns.
  
 IMHO Wide partitions (storage engine rows) are more prevalent in CQL3 than 
 thrift models.
  
 But still - I do not see Iteration, so it looks to me that CQL 3 is limited 
 when compared to CLI/Hector.
 Now days you can do pretty much everything you can in cli. Provide an example 
 and we may be able to help.
  
 Cheers
 Aaron
  
 -
 Aaron Morton
 New Zealand
 @aaronmorton
  
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
  
 On 20/05/2014, at 8:18 am, Maciej Miklas mac.mik...@gmail.com wrote:
 
 Hi James,
  
 Clustering is based on rows. I think that you meant not clustering columns, 
 but compound columns. Still all columns belong to single table and are 
 stored within single folder on one computer. And it looks to me (but I’am 
 not sure) that CQL 3 driver loads all column names into memory - which is 
 confusing to me. From one side we have wide row, but we load whole into 
 ram…..
  
 My understanding of wide row is a row that supports millions of columns, or 
 similar things like map or set. In CLI you would generate column names (or 
 use compound columns) to simulate set or map,  in CQL 3 you would use some 
 static names plus Map or Set structures, or you could still alter table and 
 have large number of columns. But still - I do not see Iteration, so it 
 looks to me that CQL 3 is limited when compared to CLI/Hector.
  
  
 Regards,
 Maciej
  
 On 19 May 2014, at 17:30, James Campbell ja...@breachintelligence.com 
 wrote:
 
 Maciej,
  
 In CQL3 wide rows are expected to be created using clustering columns.  
 So while the schema will have a relatively smaller number of named columns, 
 the effect is a wide row.  For example:
  
 CREATE TABLE keyspace.widerow (
 row_key text,
 wide_row_column text,
 data_column text,
 PRIMARY KEY (row_key, wide_row_column));
  
 Check out, for example, 
 http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
  
 James
 From: Maciej Miklas mac.mik...@gmail.com
 Sent: Monday, May 19, 2014 11:20 AM
 To: user@cassandra.apache.org
 Subject: CQL 3 and wide rows
  
 Hi *,
  
 I’ve checked DataStax driver code for CQL 3, and it looks like the column 
 names for particular table are fully loaded into memory, it this true?
  
 Cassandra should support wide rows, meaning tables with millions of 
 columns. Knowing that, I would expect kind of iterator for column names. Am 
 I missing something here?
  
  
 Regards,
 Maciej Miklas
 
  
 
  



Re: CQL 3 and wide rows

2014-05-20 Thread Maciej Miklas
Hi Aron,

Thanks for the answer!


Lest consider such CLI code:

for(int i = 0 ; i  10_000_000 ; i++) {
  set[‘rowKey1’][‘myCol::i’] = UUID.randomUUID();
}


The code above will create single row, that contains 10^6 columns sorted by 
‘i’. This will work fine, and this is the wide row to my understanding - row 
that holds many columns AND I can read only some part of it by right slice 
query. On the other hand side, I can iterate over all columns without latencies 
because data is stored on single node. I’ve been using similar structures as 
replacement for secondary indexes - it’s well known pattern.

How would I model it in CQL 3?

1) I could create Map, but Maps are fully loaded into memory, and Map 
containing 10^6 elements is definitely a problem. Plus it’s a big waste of RAM 
if you consider that I need only to read small subset.

2) I could alter table for each new column, which would create similar 
structure to this one from my CLI example. But it looks to me that all columns 
names are loaded into ram, which is still large limitation. I hope that I am 
wrong here - I am not sure.

3) I could redesign my model and divide data into many rows, but why would I do 
that, if I can use wide rows.

My idea of wide row, is a row that can hold large amount of key-value pairs (in 
any form), where I can filter on those keys to efficiently load only that part 
which I currently need.


Regards,
Maciej 


On 20 May 2014, at 09:06, Aaron Morton aa...@thelastpickle.com wrote:

 In a CQL 3 table the only **column** names are the ones defined in the table, 
 in the example below there are three column names. 
 
 
 CREATE TABLE keyspace.widerow (
 row_key text,
 wide_row_column text,
 data_column text,
 PRIMARY KEY (row_key, wide_row_column));
 
 Check out, for example, 
 http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
 
 Internally there may be more **cells** ( as we now call the internal 
 columns). In the example above each value for row_key will create a single 
 partition (as we now call internal storage engine rows). In each of those 
 partitions there will be cells for each CQL 3 row that has the same row_key, 
 those cells will use a Composite for the name. The first part of the 
 composite will be the value of the wide_row_column and the second will be the 
 literal name of the non primary key columns. 
 
 IMHO Wide partitions (storage engine rows) are more prevalent in CQL3 than 
 thrift models. 
 
 But still - I do not see Iteration, so it looks to me that CQL 3 is limited 
 when compared to CLI/Hector.
 Now days you can do pretty much everything you can in cli. Provide an example 
 and we may be able to help. 
 
 Cheers
 Aaron
 
 -
 Aaron Morton
 New Zealand
 @aaronmorton
 
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
 
 On 20/05/2014, at 8:18 am, Maciej Miklas mac.mik...@gmail.com wrote:
 
 Hi James,
 
 Clustering is based on rows. I think that you meant not clustering columns, 
 but compound columns. Still all columns belong to single table and are 
 stored within single folder on one computer. And it looks to me (but I’am 
 not sure) that CQL 3 driver loads all column names into memory - which is 
 confusing to me. From one side we have wide row, but we load whole into 
 ram…..
 
 My understanding of wide row is a row that supports millions of columns, or 
 similar things like map or set. In CLI you would generate column names (or 
 use compound columns) to simulate set or map,  in CQL 3 you would use some 
 static names plus Map or Set structures, or you could still alter table and 
 have large number of columns. But still - I do not see Iteration, so it 
 looks to me that CQL 3 is limited when compared to CLI/Hector.
 
 
 Regards,
 Maciej
 
 On 19 May 2014, at 17:30, James Campbell ja...@breachintelligence.com 
 wrote:
 
 Maciej,
 
 In CQL3 wide rows are expected to be created using clustering columns.  
 So while the schema will have a relatively smaller number of named columns, 
 the effect is a wide row.  For example:
 
 CREATE TABLE keyspace.widerow (
 row_key text,
 wide_row_column text,
 data_column text,
 PRIMARY KEY (row_key, wide_row_column));
 
 Check out, for example, 
 http://www.datastax.com/dev/blog/schema-in-cassandra-1-1.​
 
 James
 From: Maciej Miklas mac.mik...@gmail.com
 Sent: Monday, May 19, 2014 11:20 AM
 To: user@cassandra.apache.org
 Subject: CQL 3 and wide rows
  
 Hi *,
 
 I’ve checked DataStax driver code for CQL 3, and it looks like the column 
 names for particular table are fully loaded into memory, it this true?
 
 Cassandra should support wide rows, meaning tables with millions of 
 columns. Knowing that, I would expect kind of iterator for column names. Am 
 I missing something here? 
 
 
 Regards,
 Maciej Miklas
 
 



Disable FS journaling

2014-05-20 Thread Paulo Ricardo Motta Gomes
Hello,

Has anyone disabled file system journaling on Cassandra nodes? Does it make
any difference on write performance?

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Disable FS journaling

2014-05-20 Thread Samir Faci
I'm not sure you'd be gaining much by doing this.  This is probably
dependent on the file system you're referring to when you say
journaling.  There's a few of them around,

You could opt to use ext2 instead of ext3/4 in the unix world.  A quick
google search linked me to this:

http://blog.serverbuddies.com/disable-journaling-in-ext3-file-system/

looks like:

tune2fs -O^has_journal /dev/xdy  #disable journaling
tune2fs -j /dev/xdy  #enable journaling.





On Tue, May 20, 2014 at 7:43 AM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

 Hello,

 Has anyone disabled file system journaling on Cassandra nodes? Does it
 make any difference on write performance?

 Cheers,

 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




-- 
Samir Faci
*insert title*
fortune | cowsay -f /usr/share/cows/tux.cow

Sent from my non-iphone laptop.


Re: CQL 3 and wide rows

2014-05-20 Thread Nate McCall
Something like this might work:


cqlsh:my_keyspace CREATE TABLE my_widerow (
 ...   id text,
 ...   my_col timeuuid,
 ...   PRIMARY KEY (id, my_col)
 ... ) WITH caching='KEYS_ONLY' AND
 ...   compaction={'class': 'LeveledCompactionStrategy'};
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace insert into my_widerow (id, my_col) values
('some_key_1',now());
cqlsh:my_keyspace select * from my_widerow;

 id | my_col
+--
 some_key_1 | 7266d240-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 73ba0630-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 74404d30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 74defe30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 75569f30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 75bf9a30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 76227ab0-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 76cfd1b0-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 777364b0-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 7aa061b0-e030-11e3-a50d-8b2f9bfbfa10

cqlsh:my_keyspace select * from my_widerow where id = 'some_key_1' and
my_col  73ba0630-e030-11e3-a50d-8b2f9bfbfa10;

 id | my_col
+--
 some_key_1 | 74404d30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 74defe30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 75569f30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 75bf9a30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 76227ab0-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 76cfd1b0-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 777364b0-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 7aa061b0-e030-11e3-a50d-8b2f9bfbfa10

cqlsh:my_keyspace select * from my_widerow where id = 'some_key_1' and
my_col  73ba0630-e030-11e3-a50d-8b2f9bfbfa10 and my_col 
76227ab0-e030-11e3-a50d-8b2f9bfbfa10;

 id | my_col
+--
 some_key_1 | 74404d30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 74defe30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 75569f30-e030-11e3-a50d-8b2f9bfbfa10
 some_key_1 | 75bf9a30-e030-11e3-a50d-8b2f9bfbfa10



These queries would all work fine from the DS Java Driver. Note that only
the cells that are needed are pulled into memory:


./bin/nodetool cfstats my_keyspace my_widerow
   ...
   Column Family: my_widerow
   ...
   Average live cells per slice (last five minutes): 6.0
   ...


This shows that we are slicing across 6 rows on average for the last couple
of select statements.

Hope that helps.



-- 
-
Nate McCall
Austin, TX
@zznate

Co-Founder  Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Re: Disable FS journaling

2014-05-20 Thread Michael Shuler

On 05/20/2014 09:54 AM, Samir Faci wrote:

I'm not sure you'd be gaining much by doing this.  This is probably
dependent on the file system you're referring to when you say
journaling.  There's a few of them around,

You could opt to use ext2 instead of ext3/4 in the unix world.  A quick
google search linked me to this:


ext2/3 is not a good choice for file size limitation and performance 
reasons.


I started to search for a couple links, and a quick check of the links I 
posted a couple years ago seem to still be interesting  ;)


http://mail-archives.apache.org/mod_mbox/cassandra-user/201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E

(repost from above)

Hopefully this is some good reading on the topic:

https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user

one of the more interesting considerations:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/%3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E

http://wiki.apache.org/cassandra/CassandraHardware

http://wiki.apache.org/cassandra/LargeDataSetConsiderations

http://www.datastax.com/dev/blog/questions-from-the-tokyo-cassandra-conference

--
Kind regards,
Michael


Re: Disable FS journaling

2014-05-20 Thread Paulo Ricardo Motta Gomes
Thanks for the links!

Forgot to mention, using XFS here, as suggested by the Cassandra wiki. But
just double checked and it's apparently not possible to disable journaling
on XFS.

One of ours sysadmin just suggested disabling journaling, since it's mostly
for recovery purposes, and Cassandra already does that pretty well with
commitlog, replication and anti-entropy. It would anyway be nice to know if
there could be any performance benefits from it. But I personally don't
think it would help much, due to the append-only nature of cassandra writes.


On Tue, May 20, 2014 at 12:43 PM, Michael Shuler mich...@pbandjelly.orgwrote:

 On 05/20/2014 09:54 AM, Samir Faci wrote:

 I'm not sure you'd be gaining much by doing this.  This is probably
 dependent on the file system you're referring to when you say
 journaling.  There's a few of them around,

 You could opt to use ext2 instead of ext3/4 in the unix world.  A quick
 google search linked me to this:


 ext2/3 is not a good choice for file size limitation and performance
 reasons.

 I started to search for a couple links, and a quick check of the links I
 posted a couple years ago seem to still be interesting  ;)

 http://mail-archives.apache.org/mod_mbox/cassandra-user/
 201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E

 (repost from above)

 Hopefully this is some good reading on the topic:

 https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F%
 2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user

 one of the more interesting considerations:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/%
 3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E

 http://wiki.apache.org/cassandra/CassandraHardware

 http://wiki.apache.org/cassandra/LargeDataSetConsiderations

 http://www.datastax.com/dev/blog/questions-from-the-tokyo-
 cassandra-conference

 --
 Kind regards,
 Michael




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: CQL 3 and wide rows

2014-05-20 Thread Maciej Miklas
Thank you Nate - now I understand it ! This is real improvement when compared 
to CLI :)

Regards,
Maciej


On 20 May 2014, at 17:16, Nate McCall n...@thelastpickle.com wrote:

 Something like this might work:
 
 
 cqlsh:my_keyspace CREATE TABLE my_widerow (
  ...   id text,
  ...   my_col timeuuid,
  ...   PRIMARY KEY (id, my_col)
  ... ) WITH caching='KEYS_ONLY' AND
  ...   compaction={'class': 'LeveledCompactionStrategy'};
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace insert into my_widerow (id, my_col) values 
 ('some_key_1',now());
 cqlsh:my_keyspace select * from my_widerow;
 
  id | my_col
 +--
  some_key_1 | 7266d240-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 73ba0630-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 74404d30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 74defe30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 75569f30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 75bf9a30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 76227ab0-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 76cfd1b0-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 777364b0-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 7aa061b0-e030-11e3-a50d-8b2f9bfbfa10
 
 cqlsh:my_keyspace select * from my_widerow where id = 'some_key_1' and 
 my_col  73ba0630-e030-11e3-a50d-8b2f9bfbfa10;
 
  id | my_col
 +--
  some_key_1 | 74404d30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 74defe30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 75569f30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 75bf9a30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 76227ab0-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 76cfd1b0-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 777364b0-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 7aa061b0-e030-11e3-a50d-8b2f9bfbfa10
 
 cqlsh:my_keyspace select * from my_widerow where id = 'some_key_1' and 
 my_col  73ba0630-e030-11e3-a50d-8b2f9bfbfa10 and my_col  
 76227ab0-e030-11e3-a50d-8b2f9bfbfa10;
 
  id | my_col
 +--
  some_key_1 | 74404d30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 74defe30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 75569f30-e030-11e3-a50d-8b2f9bfbfa10
  some_key_1 | 75bf9a30-e030-11e3-a50d-8b2f9bfbfa10
 
 
 
 These queries would all work fine from the DS Java Driver. Note that only the 
 cells that are needed are pulled into memory:
 
 
 ./bin/nodetool cfstats my_keyspace my_widerow
...
Column Family: my_widerow
...
Average live cells per slice (last five minutes): 6.0
...
 
 
 This shows that we are slicing across 6 rows on average for the last couple 
 of select statements. 
 
 Hope that helps.
 
 
 
 -- 
 -
 Nate McCall
 Austin, TX
 @zznate
 
 Co-Founder  Sr. Technical Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com



Re: Disable FS journaling

2014-05-20 Thread Terje Marthinussen
Journal enabled is faster on almost all operations. 

Recovery here is more about saving you from waiting 1/2 hour from a traditional 
full file system check.

Feel free to wait if you want though! :)

Regards,
Terje

 On 21 May 2014, at 01:11, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:
 
 Thanks for the links!
 
 Forgot to mention, using XFS here, as suggested by the Cassandra wiki. But 
 just double checked and it's apparently not possible to disable journaling on 
 XFS.
 
 One of ours sysadmin just suggested disabling journaling, since it's mostly 
 for recovery purposes, and Cassandra already does that pretty well with 
 commitlog, replication and anti-entropy. It would anyway be nice to know if 
 there could be any performance benefits from it. But I personally don't think 
 it would help much, due to the append-only nature of cassandra writes.
 
 
 On Tue, May 20, 2014 at 12:43 PM, Michael Shuler mich...@pbandjelly.org 
 wrote:
 On 05/20/2014 09:54 AM, Samir Faci wrote:
 I'm not sure you'd be gaining much by doing this.  This is probably
 dependent on the file system you're referring to when you say
 journaling.  There's a few of them around,
 
 You could opt to use ext2 instead of ext3/4 in the unix world.  A quick
 google search linked me to this:
 
 ext2/3 is not a good choice for file size limitation and performance reasons.
 
 I started to search for a couple links, and a quick check of the links I 
 posted a couple years ago seem to still be interesting  ;)
 
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E
 
 (repost from above)
 
 Hopefully this is some good reading on the topic:
 
 https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user
 
 one of the more interesting considerations:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/%3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E
 
 http://wiki.apache.org/cassandra/CassandraHardware
 
 http://wiki.apache.org/cassandra/LargeDataSetConsiderations
 
 http://www.datastax.com/dev/blog/questions-from-the-tokyo-cassandra-conference
 
 -- 
 Kind regards,
 Michael
 
 
 
 -- 
 Paulo Motta
 
 Chaordic | Platform
 www.chaordic.com.br
 +55 48 3232.3200


Re: Disable FS journaling

2014-05-20 Thread Paulo Ricardo Motta Gomes
On Tue, May 20, 2014 at 1:24 PM, Terje Marthinussen tmarthinus...@gmail.com
 wrote:

 Journal enabled is faster on almost all operations.


Good to know, thanks!



 Recovery here is more about saving you from waiting 1/2 hour from a
 traditional full file system check.


On an EC2 environment you normally lose the machine anyway on failures, so
that's not of much use in that case.


 Feel free to wait if you want though! :)

 Regards,
 Terje

 On 21 May 2014, at 01:11, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Thanks for the links!

 Forgot to mention, using XFS here, as suggested by the Cassandra wiki. But
 just double checked and it's apparently not possible to disable journaling
 on XFS.

 One of ours sysadmin just suggested disabling journaling, since it's
 mostly for recovery purposes, and Cassandra already does that pretty well
 with commitlog, replication and anti-entropy. It would anyway be nice to
 know if there could be any performance benefits from it. But I personally
 don't think it would help much, due to the append-only nature of cassandra
 writes.


 On Tue, May 20, 2014 at 12:43 PM, Michael Shuler 
 mich...@pbandjelly.orgwrote:

 On 05/20/2014 09:54 AM, Samir Faci wrote:

 I'm not sure you'd be gaining much by doing this.  This is probably
 dependent on the file system you're referring to when you say
 journaling.  There's a few of them around,

 You could opt to use ext2 instead of ext3/4 in the unix world.  A quick
 google search linked me to this:


 ext2/3 is not a good choice for file size limitation and performance
 reasons.

 I started to search for a couple links, and a quick check of the links I
 posted a couple years ago seem to still be interesting  ;)

 http://mail-archives.apache.org/mod_mbox/cassandra-user/
 201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E

 (repost from above)

 Hopefully this is some good reading on the topic:

 https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F%
 2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user

 one of the more interesting considerations:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/%
 3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E

 http://wiki.apache.org/cassandra/CassandraHardware

 http://wiki.apache.org/cassandra/LargeDataSetConsiderations

 http://www.datastax.com/dev/blog/questions-from-the-tokyo-
 cassandra-conference

 --
 Kind regards,
 Michael




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Disable FS journaling

2014-05-20 Thread Kevin Burton
My gut says you won't see much of a performance boost.  Especially if
you're on SSD as the journal isn't going to be hindered by random write
speed.

Also, I *believe* you will lose filesystem metadata too… which Cassandra
doesn't protect you from.


On Tue, May 20, 2014 at 9:30 AM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:


 On Tue, May 20, 2014 at 1:24 PM, Terje Marthinussen 
 tmarthinus...@gmail.com wrote:

 Journal enabled is faster on almost all operations.


 Good to know, thanks!



 Recovery here is more about saving you from waiting 1/2 hour from a
 traditional full file system check.


 On an EC2 environment you normally lose the machine anyway on failures, so
 that's not of much use in that case.


 Feel free to wait if you want though! :)

 Regards,
 Terje

 On 21 May 2014, at 01:11, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Thanks for the links!

 Forgot to mention, using XFS here, as suggested by the Cassandra wiki.
 But just double checked and it's apparently not possible to disable
 journaling on XFS.

 One of ours sysadmin just suggested disabling journaling, since it's
 mostly for recovery purposes, and Cassandra already does that pretty well
 with commitlog, replication and anti-entropy. It would anyway be nice to
 know if there could be any performance benefits from it. But I personally
 don't think it would help much, due to the append-only nature of cassandra
 writes.


 On Tue, May 20, 2014 at 12:43 PM, Michael Shuler 
 mich...@pbandjelly.orgwrote:

 On 05/20/2014 09:54 AM, Samir Faci wrote:

 I'm not sure you'd be gaining much by doing this.  This is probably
 dependent on the file system you're referring to when you say
 journaling.  There's a few of them around,

 You could opt to use ext2 instead of ext3/4 in the unix world.  A quick
 google search linked me to this:


 ext2/3 is not a good choice for file size limitation and performance
 reasons.

 I started to search for a couple links, and a quick check of the links I
 posted a couple years ago seem to still be interesting  ;)

 http://mail-archives.apache.org/mod_mbox/cassandra-user/
 201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E

 (repost from above)

 Hopefully this is some good reading on the topic:

 https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F%
 2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user

 one of the more interesting considerations:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/%
 3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E

 http://wiki.apache.org/cassandra/CassandraHardware

 http://wiki.apache.org/cassandra/LargeDataSetConsiderations

 http://www.datastax.com/dev/blog/questions-from-the-tokyo-
 cassandra-conference

 --
 Kind regards,
 Michael




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+
profilehttps://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.


Re: Best partition type for Cassandra with JBOD

2014-05-20 Thread Kevin Burton
This has not been my experience… In my benchmarks over the years noatime
has mattered.

However, I might have not been as scientifically motivated to falsify the
noatime hypothesis… specifically, I might just have accidentally used
confirmation bias and assumed that noatime mattered and then moved on.

A real benchmark would work :)


On Mon, May 19, 2014 at 11:53 AM, Bryan Talbot bryan.tal...@playnext.comwrote:

 For XFS, using noatime and nodirtime isn't really useful either.


 http://xfs.org/index.php/XFS_FAQ#Q:_Is_using_noatime_or.2Fand_nodiratime_at_mount_time_giving_any_performance_benefits_in_xfs_.28or_not_using_them_performance_decrease.29.3F




 On Sat, May 17, 2014 at 7:52 AM, James Campbell 
 ja...@breachintelligence.com wrote:

  Thanks for the thoughts!
 On May 16, 2014 4:23 PM, Ariel Weisberg ar...@weisberg.ws wrote:
  Hi,

 Recommending nobarrier (mount option barrier=0) when you don't know if a
 non-volatile cache in play is probably not the way to go. A non-volatile
 cache will typically ignore write barriers if a given block device is
 configured to cache writes anyways.

 I am also skeptical you will see a boost in performance. Applications
 that want to defer and batch writes won't emit write barriers frequently
 and when they do it's because the data has to be there. Filesystems depend
 on write barriers although it is surprisingly hard to get a reordering that
 is really bad because of the way journals are managed.

 Cassandra uses log structured storage and supports asynchronous periodic
 group commit so it doesn't need to emit write barriers frequently.

 Setting read ahead to zero on an SSD is necessary to get the maximum
 number of random reads, but will also disable prefetching for sequential
 reads. You need a lot less prefetching with an SSD due to the much faster
 response time, but it's still many microseconds.

 Someone with more Cassandra specific knowledge can probably give better
 advice as to when a non-zero read ahead make sense with Cassandra. This is
 something may be workload specific as well.

 Regards,
  Ariel

 On Fri, May 16, 2014, at 01:55 PM, Kevin Burton wrote:

 That and nobarrier… and probably noop for the scheduler if using SSD and
 setting readahead to zero...


  On Fri, May 16, 2014 at 10:29 AM, James Campbell 
 ja...@breachintelligence.com wrote:

  Hi all—



 What partition type is best/most commonly used for a multi-disk JBOD
 setup running Cassandra on CentOS 64bit?



 The datastax production server guidelines recommend XFS for data
 partitions, saying, “Because Cassandra can use almost half your disk space
 for a single file, use XFS when using large disks, particularly if using a
 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and
 essentially unlimited on 64-bit.”



 However, the same document also notes that “Maximum recommended capacity
 for Cassandra 1.2 and later is 3 to 5TB per node,” which makes me think
 16TB file sizes would be irrelevant (especially when not using RAID to
 create a single large volume).  What has been the experience of this group?



 I also noted that the guidelines don’t mention setting noatime and
 nodiratime flags in the fstab for data volumes, but I wonder if that’s a
 common practice.

 James




 --


  Founder/CEO Spinn3r.com
  Location: *San Francisco, CA*
  Skype: *burtonator*
  blog: http://burtonator.wordpress.com
  … or check out my Google+ 
 profilehttps://plus.google.com/102718274791889610666/posts
  http://spinn3r.com
  War is peace. Freedom is slavery. Ignorance is strength. Corporations
 are people.





 --
 Bryan Talbot
 Architect / Platform team lead, Aeria Games and Entertainment
 Silicon Valley | Berlin | Tokyo | Sao Paulo




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+
profilehttps://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.


Re: Ec2 Network I/O

2014-05-20 Thread Ben Bromhead
Also once you've got your phi_convict_threshold sorted, if you see these again 
check:

http://status.aws.amazon.com/ 

AWS does occasionally have the odd increased latency issue / outage. 

Ben Bromhead
Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359


On 19/05/2014, at 1:15 PM, Nate McCall n...@thelastpickle.com wrote:

 It's a good idea to increase phi_convict_threshold to at least 12 on EC2. 
 Using placement groups and single-tenant systems will certainly help.
 
 Another optimization would be dedicating an Enhanced Network Interface 
 (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html) 
 specifically for gossip traffic. 
 
 
 On Mon, May 19, 2014 at 1:36 PM, Phil Burress philburress...@gmail.com 
 wrote:
 Has anyone experienced network i/o issues with ec2? We are seeing a lot of 
 these in our logs:
 
 HintedHandOffManager.java (line 477) Timed out replaying hints to 
 /10.0.x.xxx; aborting (15 delivered)
 
 and these...
 
 Cannot handshake version with /10.0.x.xxx
 
 and these...
 
 java.io.IOException: Cannot proceed on repair because a neighbor 
 (/10.0.x.xxx) is dead: session failed
 
 Occurs on all of our nodes. Even though in all cases, the host that is being 
 reported as down or unavailable is up and readily 'pingable'.
 
 We are using shared tenancy on all our nodes (instance type m1.xlarge) with 
 cassandra 2.0.7. Any suggestions on how to debug these errors?
 
 Is there a recommendation to move to Placement Groups for Cassandra?
 
 Thanks!
 
 Phil 
 
 
 
 -- 
 -
 Nate McCall
 Austin, TX
 @zznate
 
 Co-Founder  Sr. Technical Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com



CassandraStorage loader generating 2x many record?

2014-05-20 Thread Kevin Burton
This has to be a bug or either that or I'm insane.

Here's my table in Cassandra:

CREATE TABLE test_source (
  id int ,
  primary key(id)
);

INSERT INTO test_source (ID) VALUES(1);
INSERT INTO test_source (ID) VALUES(2);
INSERT INTO test_source (ID) VALUES(3);
INSERT INTO test_source (ID) VALUES(4);

cqlsh:blogindex select * from test_source;

 id

  1
  2
  4
  3

(4 rows)

… now I load that into pig and run:

test_source = LOAD 'cassandra://blogindex/test_source' USING
CassandraStorage() AS (source, target: bag {T: tuple(name, value)});

dump test_source;

(4,{((),)})
(1,{((),)})
(2,{((),)})
(4,{((),)})
(1,{((),)})
(3,{((),)})
(3,{((),)})
(2,{((),)})

… now it COULD be a bug with 'dump' … but even then that's a bug.

I suspect that Cassandra might be getting confused and giving too many rows
to pig due to maybe duplicating input splits?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+
profilehttps://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.


Is the tarball for a given release in a Maven repository somewhere?

2014-05-20 Thread Clint Kelly
Hi all,

I am using the maven assembly plugin to build a project that contains
a development environment for a project that we've built at work on
top of Cassandra.  I'd like this development environment to include
the latest release of Cassandra.

Is there a maven repo anywhere that contains an artifact with the
Cassandra release in it?  I'd like to have the same Cassandra tarball
that you can download from the website be a dependency for my project.
 I can then have the assembly plugin untar it and customize some of
the conf files before taring up our entire development environment.
That way, anyone using our development environment would have access
to the various shell scripts and tools.

I poked around online and could not find what I was looking for.  Any
help would be appreciated!

Best regards,
Clint


RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-20 Thread Anton Brazhnyk
I went with recommendations to create my own input format or backport the 2.0.7 
code and it works now.
To be more specific...
AbstractColumnFamilyInputFormat. getSplits(JobContext) handled just the case 
with ordered partitioner and ranges based on keys.
It did converted keys to tokens and used all the support which is there on low 
level (which you probably talk about).
BUT there were no way to engage that support via ColumnFamilyInputFormat and 
ConfigHelper.setInputRange(startToken, endToken)
prior to 2.0.7 without tapping into the code of C*.


-Original Message-
From: Aaron Morton [mailto:aa...@thelastpickle.com] 
Sent: Monday, May 19, 2014 11:58 PM
To: Cassandra User
Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

 between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
 working
 Can you confirm or disprove?


My reading of the code is that it will consider the part of a token range (from 
vnodes or initial tokens) that overlap with the provided token range. 

 I've already got one confirmation that in C* version I use (1.2.15) setting 
 limits with setInputRange(startToken, endToken) doesn't work.
Can you be more specific ?

 work only for ordered partitioners (in 1.2.15).

it will work with ordered and unordered partitioners equally. The difference is 
probably what you consider to working to mean.  The token ranges are handled 
the same, it's the row in them that changes. 

Cheers
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 20/05/2014, at 11:37 am, Anton Brazhnyk anton.brazh...@genesys.com wrote:

 Hi Aaron,
  
 I've seen the code which you describe (working with splits and intersections) 
 but that range is derived from keys and work only for ordered partitioners 
 (in 1.2.15).
 I've already got one confirmation that in C* version I use (1.2.15) setting 
 limits with setInputRange(startToken, endToken) doesn't work.
 between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
 working
 Can you confirm or disprove?
  
 WBR,
 Anton
  
 From: Aaron Morton [mailto:aa...@thelastpickle.com]
 Sent: Monday, May 19, 2014 1:58 AM
 To: Cassandra User
 Subject: Re: Cassandra token range support for Hadoop 
 (ColumnFamilyInputFormat)
  
 The limit is just ignored and the entire column family is scanned.
 Which limit ? 
 
 
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 From what I understand setting the input range is used when calculating the 
 splits. The token ranges in the cluster are iterated and if they intersect 
 with the supplied range the overlapping range is used to calculate the split. 
 Rather than the full token range. 
  
 2. Is there other way to limit the amount of data read from Cassandra 
 with Spark and ColumnFamilyInputFormat, so that this amount is predictable 
 (like 5% of entire dataset)?
 if you suppled a token range is that is 5% of the possible range of values 
 for the token that should be close to a random 5% sample. 
  
  
 Hope that helps. 
 Aaron
  
 -
 Aaron Morton
 New Zealand
 @aaronmorton
  
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
  
 On 14/05/2014, at 10:46 am, Anton Brazhnyk anton.brazh...@genesys.com wrote:
 
 
 Greetings,
 
 I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd 
 like to read just part of it - something like Spark's sample() function.
 Cassandra's API seems allow to do it with its 
 ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, 
 but it doesn't work.
 The limit is just ignored and the entire column family is scanned. It 
 seems this kind of feature is just not supported and sources of 
 AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
 Questions:
 1. Am I right that there is no way to get some data limited by token range 
 with ColumnFamilyInputFormat?
 2. Is there other way to limit the amount of data read from Cassandra 
 with Spark and ColumnFamilyInputFormat, so that this amount is predictable 
 (like 5% of entire dataset)?
 
 
 WBR,
 Anton
 




Memory issue

2014-05-20 Thread opensaf dev
Hi guys,

I am trying to run Cassandra on CentOS as an user X other then root or
cassandra. When I run as user cassandra, it starts and runs fine. But, when
I run under user X, I am getting the below error once cassandra started and
system freezes totally.

*Insufficient memlock settings:*

WARN [main] 2011-06-15 09:58:56,861 CLibrary.java (line 118) Unable to
lock JVM memory (ENOMEM).
This can result in part of the JVM being swapped out, especially with
mmapped I/O enabled.
Increase RLIMIT_MEMLOCK or run Cassandra as root.


I have tried the tips available online to change the memlock and other
limits both for users cassadra and X, but did not solve the problem.


What else I should consider when I run cassandra other then user cassandra/root.


Any help is much appreciated.


Thanks

Dev


RE: Memory issue

2014-05-20 Thread Romain HARDOUIN
Hi,

You have to define limits for the user. 
Here is an example for the user cassandra:

# cat /etc/security/limits.d/cassandra.conf 
cassandra   -   memlock unlimited
cassandra   -   nofile  10

best,

Romain

opensaf dev opensaf...@gmail.com a écrit sur 21/05/2014 06:59:05 :

 De : opensaf dev opensaf...@gmail.com
 A : user@cassandra.apache.org, 
 Date : 21/05/2014 07:00
 Objet : Memory issue
 
 Hi guys,
 
 I am trying to run Cassandra on CentOS as an user X other then root 
 or cassandra. When I run as user cassandra, it starts and runs fine.
 But, when I run under user X, I am getting the below error once 
 cassandra started and system freezes totally.
 
 Insufficient memlock settings:
 WARN [main] 2011-06-15 09:58:56,861 CLibrary.java (line 118) Unable 
 to lock JVM memory (ENOMEM).
 This can result in part of the JVM being swapped out, especially 
 with mmapped I/O enabled.
 Increase RLIMIT_MEMLOCK or run Cassandra as root.
 
 
 I have tried the tips available online to change the memlock and 
 other limits both for users cassadra and X, but did not solve the 
problem.
 

 What else I should consider when I run cassandra other then user 
 cassandra/root.
 
 
 Any help is much appreciated.
 
 
 Thanks
 Dev
 


RE: Memory issue

2014-05-20 Thread Romain HARDOUIN
Well... you have already changed the limits ;-)
Keep in mind that changes in the limits.conf file will not affect 
processes that are already running.

opensaf dev opensaf...@gmail.com a écrit sur 21/05/2014 06:59:05 :

 De : opensaf dev opensaf...@gmail.com
 A : user@cassandra.apache.org, 
 Date : 21/05/2014 07:00
 Objet : Memory issue
 
 Hi guys,
 
 I am trying to run Cassandra on CentOS as an user X other then root 
 or cassandra. When I run as user cassandra, it starts and runs fine.
 But, when I run under user X, I am getting the below error once 
 cassandra started and system freezes totally.
 
 Insufficient memlock settings:
 WARN [main] 2011-06-15 09:58:56,861 CLibrary.java (line 118) Unable 
 to lock JVM memory (ENOMEM).
 This can result in part of the JVM being swapped out, especially 
 with mmapped I/O enabled.
 Increase RLIMIT_MEMLOCK or run Cassandra as root.
 
 
 I have tried the tips available online to change the memlock and 
 other limits both for users cassadra and X, but did not solve the 
problem.
 

 What else I should consider when I run cassandra other then user 
 cassandra/root.
 
 
 Any help is much appreciated.
 
 
 Thanks
 Dev