from:"shahab"

Why I can not do a "count(*) ... allow filtering " without facing operation timeout?

2015-09-04 Thread shahab

Hi,

This is probably a silly problem , but it is really serious for me. I have
a cluster of 3 nodes, with replication factor 2. But still I can not do a
simple "select count(*) from ..."  neither using DevCenter nor "cqlsh" .
Any idea how this can be done?

best,
/Shahab

How to export query results (milions rows) as CSV fomat?

2015-07-07 Thread shahab

Hi,

Is there any way to export the results of a query (e.g. select * from tbl1
where id =aa and loc =bb) into a file as CSV format?

I tried to use COPY command with cqlsh, but the command does not work
when you have where  condition ?!!!

does any have any idea how to do this?

best,
/Shahab

How to measure disk space used by a keyspace?

2015-06-29 Thread shahab

Hi,

Probably this question has been already asked in the mailing list, but I
couldn't find it.

The question is how to measure disk-space used by a keyspace, column family
wise, excluding snapshots?

best,
/Shahab

Re: How to store denormalized data

2015-06-03 Thread Shahab Yunus

Suggestion or rather food for thought

Do you expect to read/analyze the written data right away? Or will it be a
batch process, kicked off later in time? What I am trying to say is that if
the 'read/analysis' part is a) batch process and b) kicked off later in
time, then #3 is a fine solution? What harm in it? Also, you can slightly
change it, (if applicable) and not populate as a separate batch process but
in fact make part of  your analysis job? Kind of a pre-process/prep step?

Regards,
Shahab

On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson matt.john...@algomi.com
wrote:

 Hi all,



 I am trying to store some data (user actions in our application) for
 future analysis (probably using Spark). I understand best practice is to
 store it in denormalized form, and this will definitely make some of our
 future queries much easier. But I have a problem with denormalizing the
 data.



 For example, let’s say one of my queries is “the number of reports
 generated by user type”. In the part of the application that the user
 connects to to generate reports, we only have access to the user id. In a
 traditional RDBMS, this is fine, because at query time you join the user id
 onto the users table and get all the user data associated with that user.
 But how do I populate extra fields like user type on the fly?



 My ideas so far:

 1.   I try and maintain an in-memory cache of data such as “user”,
 and do a lookup to this cache for every user action and store the user data
 with it. #PROS: fast #CONS: not scalable, will run out of memory if data
 sets grow

 2.   For each user action, I do a call to RDBMS and look up the data
 for the user in question, then store the user action plus the user data as
 a single row. #PROS easy to scale #CONS slow

 3.   I write only the user id and the action straight away, and have
 a separate batch process that periodically goes through my table looking
 for rows without user data, and looks up the user data from RDBMS and
 populates it





 None of these solutions seem ideal to me. Does Cassandra have something
 like ‘triggers’, where I can set up a table to automatically populate some
 rows based on a lookup from another table? Or perhaps Spark or some other
 library has built-in functionality that solves exactly this problem?



 Any suggestions much appreciated.



 Thanks,

 Matthew

Re: Data model suggestions

2015-04-26 Thread Shahab Yunus

Interesting approach Oded.

Is this something similar that has been described here:
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Regards,
Shahab

On Sun, Apr 26, 2015 at 4:29 AM, Peer, Oded oded.p...@rsa.com wrote:

  I would maintain two tables.

 An “archive” table that holds all the active and inactive records, and is
 updated hourly (re-inserting the same record has some compaction overhead
 but on the other side deleting records has tombstones overhead).

 An “active” table which holds all the records in the last external API
 invocation.

 To avoid tombstones and read-before-delete issues “active” should actually
 a synonym, an alias, to the most recent active table.

 I suggest you create two identical tables, “active1” and “active2”, and an
 “active_alias” table that informs which of the two is the most recent.

 Thus when you query the external API you insert the data to “archive” and
 to the unaliased “activeN” table, switch the alias value in “active_alias”
 and truncate the new unaliased “activeM” table.

 No need to query the data before inserting it. Make sure truncating
 doesn’t create automatic snapshots.





 *From:* Narendra Sharma [mailto:narendra.sha...@gmail.com]
 *Sent:* Friday, April 24, 2015 6:53 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Data model suggestions



 I think one table say record should be good. The primary key is record id.
 This will ensure good distribution.
 Just update the active attribute to true or false.
 For range query on active vs archive records maintain 2 indexes or try
 secondary index.

 On Apr 23, 2015 1:32 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 Good point about the range selects. I think they can be made to work with
 limits, though. Or, since the active records will never usually be  500k,
 the ids may just be cached in memory.



 Most of the time, during reads, the queries will just consist of select *
 where primaryKey = someValue . One row at a time.



 The question is just, whether to keep all records in one table (including
 archived records which wont be queried 99% of the time), or to keep active
 records in their own table, and delete them when they're no longer active.
 Will that produce tombstone issues?



 On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar khangaon...@gmail.com
 wrote:

 Hi,

 If your external API returns active records, that means I am guessing you
 need to do a select * on the active table to figure out which records in
 the table are no longer active.

 You might be aware that range selects based on partition key will timeout
 in cassandra. They can however be made to work using the column cluster
 key.

 To comment more, We would need to see your proposed cassandra tables and
 queries that you might need to run.

 regards







 On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar ali.rac...@gmail.com wrote:

 That's returned by the external API we're querying. We query them for
 active records, if a previous active record isn't included in the results,
 that means its time to archive that record.



 On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar khangaon...@gmail.com
 wrote:

 Hi,

 How do you determine if the record is no longer active ? Is it a perioidic
 process that goes through every record and checks when the last update
 happened ?

 regards



 On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar ali.rac...@gmail.com wrote:

 Hey all,



 We are working on moving a mysql based application to Cassandra.



 The workflow in mysql is this: We have two tables: active and archive .
 Every hour, we pull in data from an external API. The records which are
 active, are kept in 'active' table. Once a record is no longer active, its
 deleted from 'active' and re-inserted into 'archive'



 The purpose for that, is because most of the time, queries are only done
 against the active records rather than archived. Therefore keeping the
 active table small may help with faster queries, if it only has to search
 200k records vs 3 million or more.



 Is it advisable to keep the same data model in Cassandra? I'm concerned
 about tombstone issues when records are deleted from active.



 Thanks.



   --

 http://khangaonkar.blogspot.com/





   --

 http://khangaonkar.blogspot.com/

Getting ParNew GC in ... CMS Old Gen ... in logs

2015-04-20 Thread shahab

Hi,

I am keep getting following line in the cassandra logs, apparently
something related to Garbage Collection. And I guess this is one of the
signs why i do not get any response (i get time-out) when I query large
volume of data ?!!!

 ParNew GC in 248ms.  CMS Old Gen: 453244264 - 570471312; Par Eden Space:
167712624 - 0; Par Survivor Space: 0 - 20970080

Is above line is indication of something that need to be fixed in the
system?? how can I resolve this?


best,
/Shahab

Re: best supported spark connector for Cassandra

2015-02-11 Thread shahab

I am using Calliope cassandra-spark connector(
http://tuplejump.github.io/calliope/), which is quite handy and easy to use!
The only problem is that it is a bit outdates , works with Spark 1.1.0,
hopefully new version comes soon.

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

 I just finished a scala course, nice exercise to check what I learned :D

 Thanks for the answer!

 From: user@cassandra.apache.org
 Subject: Re: best supported spark connector for Cassandra

 Start looking at the Spark/Cassandra connector here (in Scala):
 https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

 Data locality is provided by this method:
 https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

 Start digging from this all the way down the code.

 As for Stratio Deep, I can't tell how the did the integration with Spark.
 Take some time to dig down their code to understand the logic.



 On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 Taking the opportunity Spark was being discussed in another thread, I
 decided to start a new one as I have interest in using Spark + Cassandra in
 the feature.

 About 3 years ago, Spark was not an existing option and we tried to use
 hadoop to process Cassandra data. My experience was horrible and we reached
 the conclusion it was faster to develop an internal tool than insist on
 Hadoop _for our specific case_.

 How I can see Spark is starting to be known as a better hadoop and it
 seems market is going this way now. I can also see I have many more options
 to decide how to integrate Cassandra using the Spark RDD concept than using
 the ColumnFamilyInputFormat.

 I have found this java driver made by Datastax:
 https://github.com/datastax/spark-cassandra-connector

 I also have found python Cassandra support on spark's repo, but it seems
 experimental yet:
 https://github.com/apache/spark/tree/master/examples/src/main/python

 Finally I have found stratio deep: https://github.com/Stratio/deep-spark
 It seems Stratio guys have forked Cassandra also, I am still a little
 confused about it.

 Question: which driver should I use, if I want to use Java? And which if
 I want to use python?
 I think the way Spark can integrate to Cassandra makes all the difference
 in the world, from my past experience, so I would like to know more about
 it, but I don't even know which source code I should start looking...
 I would like to integrate using python and or C++, but I wonder if it
 doesn't pay the way to use the java driver instead.

 Thanks in advance

Why RDD is not cached?

2014-10-27 Thread shahab

Hi,

I have a standalone spark , where the executor is set to have 6.3 G memory
, as I am using two workers so in total there 12.6 G memory and 4 cores.

I am trying to cache a RDD with approximate size of 3.2 G, but apparently
it is not cached as neither I can seeBlockManagerMasterActor: Added
rdd_XX in memory  nor  the performance of running the tasks is improved

But, why it is not cached when there is enough memory storage?
I tried with smaller RDDs. 1 or 2 G and it works, at least I could see
BlockManagerMasterActor:
Added rdd_0_1 in memory and improvement in results.

Any idea what I am missing in my settings, or... ?

thanks,
/Shahab

Re: Increasing size of Batch of prepared statements

2014-10-23 Thread shahab

Thanks Jens for the comments.

As I am trying cassandra stress tool, does it mean that the tool is
executing batch of Insert statements (probably hundreds, or thousands)
 to the cassandra (for the sake of stressing Cassnadra ?

best,
/Shahab

On Wed, Oct 22, 2014 at 8:14 PM, Jens Rantil jens.ran...@tink.se wrote:

 Shabab,

 Apologize for the late answer.

 On Mon, Oct 6, 2014 at 2:38 PM, shahab shahab.mok...@gmail.com wrote:

 But do you mean that inserting columns with large size (let's say a text
 with 20-30 K) is potentially problematic in Cassandra?


 AFAIK, the size _warning_ you are getting relates to the size of the batch
 of prepared statements (INSERT INTO mykeyspace.mytable VALUES (?,?,?,?)).
 That is, it has nothing to do with the actual content of your row. 20-30 K
 shouldn't be a problem. But it's considered good practise to split larger
 files (maybe  5 MB into chunks) since it makes operations easier to your
 cluster more likely to spread more evenly across cluster.


 What shall i do if I want columns with large size?


 Just don't insert to many rows in a single batch and you should be fine.
 Like Shane's JIRA ticket said, the warning is to let you know you are not
 following best practice when adding too many rows in a single batch. It can
 create bottlenecks in a single Cassandra node.

 Cheers,
 Jens

 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink

Re: Increasing size of Batch of prepared statements

2014-10-23 Thread shahab

OK, Thanks again Jens.

best,
/Shahab

On Thu, Oct 23, 2014 at 1:22 PM, Jens Rantil jens.ran...@tink.se wrote:

Hi again Shabab,

Yes, it seems that way. I have no experience with the “cassandra stress
tool”, but wouldn’t be surprised if the batch size could be tweaked.

Cheers,
Jens

——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se
Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter

On Thu, Oct 23, 2014 at 10:00 AM, shahab shahab.mok...@gmail.com wrote:

Thanks Jens for the comments.

As I am trying cassandra stress tool, does it mean that the tool is
executing batch of Insert statements (probably hundreds, or thousands)
to the cassandra (for the sake of stressing Cassnadra ?

best,
/Shahab

On Wed, Oct 22, 2014 at 8:14 PM, Jens Rantil jens.ran...@tink.se wrote:

Shabab,

Apologize for the late answer.

On Mon, Oct 6, 2014 at 2:38 PM, shahab shahab.mok...@gmail.com wrote:

But do you mean that inserting columns with large size (let's say a
text with 20-30 K) is potentially problematic in Cassandra?

AFAIK, the size _warning_ you are getting relates to the size of the
batch of prepared statements (INSERT INTO mykeyspace.mytable VALUES
(?,?,?,?)). That is, it has nothing to do with the actual content of
your row. 20-30 K shouldn't be a problem. But it's considered good practise
to split larger files (maybe 5 MB into chunks) since it makes operations
easier to your cluster more likely to spread more evenly across cluster.

What shall i do if I want columns with large size?

Just don't insert to many rows in a single batch and you should be fine.
Like Shane's JIRA ticket said, the warning is to let you know you are not
following best practice when adding too many rows in a single batch. It can
create bottlenecks in a single Cassandra node.

Cheers,
Jens

--
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
Twitter https://twitter.com/tink

Re: Increasing size of Batch of prepared statements

2014-10-23 Thread shahab

Thanks Tyler for sharing this. It is exactly what I was looking for to know.

best,
/Shahab

On Thu, Oct 23, 2014 at 5:37 PM, Tyler Hobbs ty...@datastax.com wrote:

 CASSANDRA-8091 (Stress tool creates too large batches) is relevant:
 https://issues.apache.org/jira/browse/CASSANDRA-8091

 On Thu, Oct 23, 2014 at 6:28 AM, shahab shahab.mok...@gmail.com wrote:

 OK, Thanks again Jens.

 best,
 /Shahab

 On Thu, Oct 23, 2014 at 1:22 PM, Jens Rantil jens.ran...@tink.se wrote:

 Hi again Shabab,

 Yes, it seems that way. I have no experience with the “cassandra stress
 tool”, but wouldn’t be surprised if the batch size could be tweaked.

 Cheers,
 Jens

 ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter


 On Thu, Oct 23, 2014 at 10:00 AM, shahab shahab.mok...@gmail.com
 wrote:

 Thanks Jens for the comments.

 As I am trying cassandra stress tool, does it mean that the tool is
 executing batch of Insert statements (probably hundreds, or thousands)
  to the cassandra (for the sake of stressing Cassnadra ?

 best,
 /Shahab

 On Wed, Oct 22, 2014 at 8:14 PM, Jens Rantil jens.ran...@tink.se
 wrote:

  Shabab,

 Apologize for the late answer.

 On Mon, Oct 6, 2014 at 2:38 PM, shahab shahab.mok...@gmail.com
 wrote:

 But do you mean that inserting columns with large size (let's say a
 text with 20-30 K) is potentially problematic in Cassandra?


 AFAIK, the size _warning_ you are getting relates to the size of the
 batch of prepared statements (INSERT INTO mykeyspace.mytable VALUES
 (?,?,?,?)). That is, it has nothing to do with the actual content of
 your row. 20-30 K shouldn't be a problem. But it's considered good 
 practise
 to split larger files (maybe  5 MB into chunks) since it makes operations
 easier to your cluster more likely to spread more evenly across cluster.


 What shall i do if I want columns with large size?


 Just don't insert to many rows in a single batch and you should be
 fine. Like Shane's JIRA ticket said, the warning is to let you know you 
 are
 not following best practice when adding too many rows in a single batch. 
 It
 can create bottlenecks in a single Cassandra node.

 Cheers,
 Jens

 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

  Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink







 --
 Tyler Hobbs
 DataStax http://datastax.com/

Re: Increasing size of Batch of prepared statements

2014-10-06 Thread shahab

Thanks Jens for the comment. Actually I am using Cassandra Stress Tool and
this is the tools who inserts such a large statements.

But do you mean that inserting columns with large size (let's say a text
with 20-30 K) is potentially problematic in Cassandra? What shall i do if I
want columns with large size?

best,
/Shahab

On Sun, Oct 5, 2014 at 6:03 PM, Jens Rantil jens.ran...@tink.se wrote:

 Shabab,
 If you are hitting this limit because you are inserting a lot of (CQL)
 rows in a single batch I suggest you split the statement up in multiple
 smaller batches. Generally, large inserts like this will not perform very
 well.

 Cheers,
 Jens

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Fri, Oct 3, 2014 at 6:47 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am getting the following warning in the cassandra log:
  BatchStatement.java:258 - Batch of prepared statements for [mydb.mycf]
 is of size 3272725, exceeding specified threshold of 5120 by 3267605.

 Apparently it relates to the (default) size of prepared  insert statement
 . Is there any way to change the default value?

 thanks
 /Shahab

Re: Increasing size of Batch of prepared statements

2014-10-05 Thread shahab

Thanks Shane.

best,
/Shahab

On Fri, Oct 3, 2014 at 6:51 PM, Shane Hansen shanemhan...@gmail.com wrote:

 It appears to be configurable in cassandra.yaml
 using batch_size_warn_threshold

 https://issues.apache.org/jira/browse/CASSANDRA-6487


 On Fri, Oct 3, 2014 at 10:47 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am getting the following warning in the cassandra log:
  BatchStatement.java:258 - Batch of prepared statements for [mydb.mycf]
 is of size 3272725, exceeding specified threshold of 5120 by 3267605.

 Apparently it relates to the (default) size of prepared  insert statement
 . Is there any way to change the default value?

 thanks
 /Shahab

Why results of Cassandra Stress Toll is much worse than normal reading/writing from Cassandra?

2014-10-05 Thread shahab

Hi,

I know that this question might look silly, but I really need to know how
the cassandra stress tool works.

I developed my data model and used Cassandra-Stress tool with u option
where you pass your own data-model for column family (Table in CQL )and
distribution of each column in the column family. Then I run it but when I
compare the latency of it with what I could get from simple measurement
of the time it took to execute a read/write operation for 100 cases.
 I could see that the results of Stress-Tool is much more (100 times)  than
what I have obtained in my own code.

I know that stress test should stress the system and thus higher rates
(.e.g. read/write latency) is reasonable, but 100 time more?
I am sure I am missing something in interpreting the results of Stress-Tool
(mainly the concept of latency). But there is very little vague
documentation there. I do appreciate of any one could help me to understand
the output of Stress-Tool?

BTW, I have already seen this one (but still the documentation is quite
poor):
http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCStressOutput_c.html

best,
/Shahab

Increasing size of Batch of prepared statements

2014-10-03 Thread shahab

Hi,

I am getting the following warning in the cassandra log:
 BatchStatement.java:258 - Batch of prepared statements for [mydb.mycf] is
of size 3272725, exceeding specified threshold of 5120 by 3267605.

Apparently it relates to the (default) size of prepared  insert statement .
Is there any way to change the default value?

thanks
/Shahab

Regarding Cassandra-Stress tool

2014-10-01 Thread shahab

Hi,

I am trying to benchmark our custom schema in Cassandra and I managed to
run it. However there are couple of setting and issues which I couldn't
find any solution/explanation for. I appreciate any comments.
1- The default number of warm-up iterations in stress tool is about 5.
I would like to reduce this number (due to my storage space limitations),
but I couldn't find any input parameters to do this. I just wonder if this
setting is possible ?


2- I did not understand well what does the output of cassandra stress tool
mean? I read  this
http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCStressOutput_c.html,
but . for example, what does latency means here? does it mean how long a
read/write operation is delayed until it is executed? in this case, what is
the measure for actual read/write operation?
It seems that the documentation is outdated, there is an output parameter
partition_rate which is not explained in documentation?

best,
/Shahab

cassandra stress tools

2014-10-01 Thread shahab

Hi,

I am trying to benchmark our custom schema in Cassandra and I managed to
run it. However there are couple of setting and issues which I couldn't
find any solution/explanation for. I appreciate any comments.
1- The default number of warm-up iterations in stress tool is about 5.
I would like to reduce this number (due to my storage space limitations),
but I couldn't find any input parameters to do this. I just wonder if this
setting is possible ?


2- I did not understand well what does the output of cassandra stress tool
mean? I read  this
http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCStressOutput_c.html,
but . for example, what does latency means here? does it mean how long a
read/write operation is delayed until it is executed? in this case, what is
the measure for actual read/write operation?
It seems that the documentation is outdated, there is an output parameter
partition_rate which is not explained in documentation?

best,
/Shahab

Re: using dynamic cell names in CQL 3

2014-09-25 Thread shahab

Thanks,
It seems that I was not clear in my question, I would like to store values
in the column name, for example column.name would be event_name
(temperature) and column-content would be the respective value (e.g.
40.5) . And I need to know how the schema should look like in CQL 3

best,
/Shahab


On Wed, Sep 24, 2014 at 1:49 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Dynamic thing in Thrift ≈ clustering columns in CQL

 Can you give more details about your data model ?

 On Wed, Sep 24, 2014 at 1:11 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I  would like to define schema for a table where the column (cell) names
 are defined dynamically. Apparently there is a way to do this in Thrift (
 http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
 )

 but i couldn't find how i can do the same using CQL?

 Any resource/example that I can look at ?


 best,
 /Shahab

using dynamic cell names in CQL 3

2014-09-24 Thread shahab

Hi,

I  would like to define schema for a table where the column (cell) names
are defined dynamically. Apparently there is a way to do this in Thrift (
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows)

but i couldn't find how i can do the same using CQL?

Any resource/example that I can look at ?


best,
/Shahab

Re: Machine Learning With Cassandra

2014-08-30 Thread Shahab Yunus

Spark is not storage, rather it is a streaming framework supposed to be run
on big data, distributed architecture (a very high-level intro/definition).
It provides batched version of in-memory map/reduce like jobs. It is not
completely streaming like Storm but rather batches collection of tuples and
thus you can run complex ML algorithms relatively faster.

I think we just discussed this a short while ago when similar question
(storm vs. spark, I think) was raised by you earlier. Here is the link for
that discussion:
http://markmail.org/message/lc4icuw4hobul6oh


Regards,
Shahab


On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   Isn’t a bit overkill to use Storm and Spark in the architecture? You
 say load it “into” Spark. Is Spark separate storage?

 B.

  *From:* Alex Kamil alex.ka...@gmail.com
 *Sent:* Friday, August 29, 2014 10:46 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Machine Learning With Cassandra

  Adaryl,

 most ML algorithms  are based on some form of numerical optimization,
 using something like online gradient descent
 http://en.wikipedia.org/wiki/Stochastic_gradient_descent or conjugate
 gradient
 http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html (e.g
 in SVM classifiers). In its simplest form it is a nested FOR loop where on
 each iteration you update the weights or parameters of the model until
 reaching some convergence threshold that minimizes the prediction error
 (usually the goal is to minimize  a Loss function
 http://en.wikipedia.org/wiki/Loss_function, as in a popular least
 squares http://en.wikipedia.org/wiki/Least_squares technique). You
 could parallelize this loop using a brute force divide-and-conquer
 approach, mapping a chunk of data to each node and a computing partial sum
 there, then aggregating the results from each node into a global sum in a
 'reduce' stage, and repeating this map-reduce cycle until convergence. You
 can look up distributed gradient descent
 http://scholar.google.com/scholar?hl=enq=gradient+descent+with+map-reduc
 or check out Mahout
 https://mahout.apache.org/users/recommender/matrix-factorization.html
 or Spark MLlib https://spark.apache.org/docs/latest/mllib-guide.html
 for examples. Alternatively you can use something like GraphLab
 http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html
 .

 Cassandra can serve a data store from which you load the training data
 e.g. into Spark  using this connector
 https://github.com/datastax/spark-cassandra-connector and then train
 the model using MLlib or Mahout (it has Spark bindings I believe). Once you
 trained the model, you could save the parameters back in Cassandra. Then
 the next stage is using the model to classify new data, e.g. recommend
 similar items based on a log of new purchases, there you could once again
 use Spark or Storm with something like this
 https://github.com/pmerienne/trident-ml.

 Alex




 On Fri, Aug 29, 2014 at 10:24 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   I’m planning to speak at a local meet-up and I need to know if what I
 have in my head is even possible.

  I want to give an example of working with data in Cassandra. I have
 data coming in through Kafka and Storm and I’m saving it off to Cassandra
 (this is only on paper at this point). I then want to run an ML algorithm
 over the data. My problem here is, while my data is distributed, I don’t
 know how to do the analysis in a distributed manner. I could certainly use
 R but processing the data on a single machine would seem to defeat the
 purpose of all this scalability.

  What is my solution?
  B.

Re: Why select count(*) from .. hangs ?

2014-03-26 Thread shahab

Thanks for the hints. I got a better picture of how to deal with count
queries.


On Tue, Mar 25, 2014 at 7:01 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Mar 25, 2014 at 8:36 AM, shahab shahab.mok...@gmail.com wrote:

 But after iteration 8,  (i.e. inserting  150 sensor data), the
 select count(') ...)  throws time-out exception and doesn't work anymore.
 I even tried  to execute select count(*)... using Datastax DevCenter GUI,
 but I got same result.


 All operations in Cassandra are subject to (various) timeouts, which by
 default are in the scale of single digit seconds.

 If you attempt to do an operation (such as aggregates across large numbers
 of large objects) which cannot complete in this time, this is a strong
 indication that either your overall approach is inappropriate or, at very
 least, that your buckets are too large.

 =Rob

Why select count(*) from .. hangs ?

2014-03-25 Thread shahab

Hi,

I am quite new to Cassandra and  trying to evaluate its feasibility for our
application.
In our application, we need to insert roughly 30 sensor data every 30
seconds (basically we need to store time-series data).  I wrote a simple
java code to insert 30 random data every 30 seconds for 10 iterations,
and measured the number of  entries in the table after each insertion.
But after iteration 8,  (i.e. inserting  150 sensor data), the select
count(') ...)  throws time-out exception and doesn't work anymore. I even
tried  to execute select count(*)... using Datastax DevCenter GUI, but I
got same result.

I am sure that I have missed something or misunderstood how Cassandra
works, but don't know really what? I do appreciate any hints.

best,
/Shahab

Re: Why select count(*) from .. hangs ?

2014-03-25 Thread shahab

Thanks.

I run it on a Linux Server,  Dual Processor, Intel(R) Xeon(R) CPU  E5440  @
2.83GHz, 4 core each and  8 GM RAM.

Just to give an example of data inserted:
INSERT INTO traffic_by_day(segment_id, day, event_time, traffic_value)
VALUES (100, 84, '2013-04-03 07:02:00', 79);

Here is the schema:

CREATE TABLE traffic_by_day (
  segment_id int,
  day int,
  event_time timestamp,
  traffic_value int,
  PRIMARY KEY ((segment_id, day), event_time)
) WITH
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.10 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};


On Tue, Mar 25, 2014 at 4:58 PM, Michael Shuler mich...@pbandjelly.orgwrote:

 On 03/25/2014 10:36 AM, shahab wrote:

 In our application, we need to insert roughly 30 sensor data
 every 30 seconds (basically we need to store time-series data).  I
 wrote a simple java code to insert 30 random data every 30
 seconds for 10 iterations, and measured the number of  entries in the
 table after each insertion. But after iteration 8,  (i.e. inserting
 150 sensor data), the select count(') ...)  throws time-out
 exception and doesn't work anymore. I even tried  to execute select
 count(*)... using Datastax DevCenter GUI, but I got same result.


 If you could post your schema, folks may be able to help a bit better.
 Your C* version couldn't hurt.

 cqlsh DESC KEYSPACE $your_ks;

 --
 Kind regards,
 Michael

Re: Modeling multi-tenanted Cassandra schema

2013-11-13 Thread Shahab Yunus

Nate,

(slightly OT), what client API/library is recommended now that Hector is
sunsetting? Thanks.

Regards,
Shahab

On Wed, Nov 13, 2013 at 9:28 AM, Nate McCall n...@thelastpickle.com wrote:

You basically want option (c). Option (d) might work, but you would be
bending the paradigm a bit, IMO. Certainly do not use dedicated column
families or keyspaces per tennant. That never works. The list history will
show that with a few google searches and we've seen it fail badly with
several clients.

Overall, option (c) would be difficult to do in CQL without some very well
thought out abstractions and/or a deep hack on the Java driver (not
in-ellegant or impossible, just lots of moving parts to get your head
around if you are new to such). That said, depending on the size of your
project and skill of your team, this direction might be worth considering.

Usergrid (just accepted for incubation at Apache) functions this way via
the Thrift API: https://github.com/apigee/usergrid-stack

The commercial version of Usergrid has tens of thousands of active
tennants on a single cluster (same code base at the service layer as the
open source version). It uses Hector's built in virtual keyspaces:
https://github.com/hector-client/hector/wiki/Virtual-Keyspaces (NOTE:
though Hector is sunsetting/in patch maintenance, the approach is certainly
legitimate - but I'd recommend you *not* start a new project on Hector).

In short, Usergrid is the only project I know of that has a well-proven
tenant model that functions at scale, though I'm sure there are others
around, just not open sourced or actually running large deployments.

Astyanax can do this as well albeit with a little more work required:

https://github.com/Netflix/astyanax/wiki/Composite-columns#how-to-use-the-prefixedserializer-but-you-really-should-use-composite-columns

Happy to clarify any of the above.

On Tue, Nov 12, 2013 at 3:19 AM, Ben Hood 0x6e6...@gmail.com wrote:

Hi,

I've just received a requirement to make a Cassandra app
multi-tenanted, where we'll have up to 100 tenants.

Most of the tables are timestamped wide row tables with a natural
application key for the partitioning key and a timestamp key as a
cluster key.

So I was considering the options:

(a) Add a tenant column to each table and stick a secondary index on
that column;
(b) Add a tenant column to each table and maintain index tables that
use the tenant id as a partitioning key;
(c) Decompose the partitioning key of each table and add the tenant
and the leading component of the key;
(d) Add the tenant as a separate clustering key;
(e) Replicate the schema in separate tenant specific key spaces;
(f) Something I may have missed;

Option (a) seems the easiest, but I'm wary of just adding secondary
indexes without thinking about it.

Option (b) seems to have the least impact of the layout of the
storage, but a cost of maintaining each index table, both code wise
and in terms of performance.

Option (c) seems quite straight forward, but I feel it might have a
significant effect on the distribution of the rows, if the cardinality
of the tenants is low.

Option (d) seems simple enough, but it would mean that you couldn't
query for a range of tenants without supplying a range of natural
application keys, through which you would need to iterate (under the
assumption that you don't use an ordered partitioner).

Option (e) appears relatively straight forward, but it does mean that
the application CQL client needs to maintain separate cluster
connections for each tenant. Also I'm not sure to what extent key
spaces were designed to partition identically structured data.

Does anybody have any experience with running a multi-tenanted
Cassandra app, or does this just depend too much on the specifics of
the application?

Cheers,

Ben

--
-
Nate McCall
Austin, TX
@zznate

Co-Founder Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Deleting data using timestamp

2013-10-09 Thread Shahab Yunus

I might be missing something obvious here but can't you afford (time-wise)
to run cleanup or repair after the deletion so that the deleted data is
gone? Assuming that  your columns are time-based data?

Regards,
Shahab


On Wed, Oct 9, 2013 at 10:35 AM, Ravikumar Govindarajan 
ravikumar.govindara...@gmail.com wrote:

 We have wide-rows accumulated in a cassandra CF and now changed our
 app-side logic.

 The application now only wants first 7 days of data from this CF.

 What is the quick way to delete old-data and at the same time make sure
 read does churn through all deleted columns?

 Lets say I do the following

 for (each key in CF)
   drop key, with timestamp=(System.currentTimeMillis-7days)

 What should I do in my read, to make sure that deleted columns don't get
 examined.

 I saw some advice on using max-timestamp per SSTable during read. Can
 someone explain if that will solve my read problem here?

 --
 Ravi

Re: Deleting data using timestamp

2013-10-09 Thread Shahab Yunus

Ahh, yes, 'compaction'. I blanked out while mentioning repair and cleanup.
That is in fact what needs to be done first and what I meant. Thanks
Robert.

Regards,
Shahab


On Wed, Oct 9, 2013 at 1:50 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Oct 9, 2013 at 7:35 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

 What is the quick way to delete old-data and at the same time make sure
 read [doesn't] churn through all deleted columns?


 Use a database that isn't log structured?

 But seriously, in 2.0 there's this :

 https://issues.apache.org/jira/browse/CASSANDRA-5514

 Which allows for timestamp hints at query time.

 And...

 https://issues.apache.org/jira/browse/CASSANDRA-5228

 Which does compaction expiration of entire SSTables based on TTL.

 =Rob

Re: get float column in cassandra mapreduce

2013-10-05 Thread Shahab Yunus

Couple of things which I could I think of. Other might have better ideas.

1- The exception is about encoding mismatch. Do you know what is your
source files's encoding and what is your system's default? E.g. it can be
ISO8859-1 in Windows, UTF-8 in Linux etc.and your file has something else.
You can explicitly use UTF-8 everywhere if you want. There is wealth of
information available on the net if you google it.

2- This is more of an aside, you are parsing your data to float and String
without checking what column it is.Baically you do the following two
conversion in all cases, no matter what the column, so what will happen if
the column is data and toDouble statement is called?:
 String value1 = ByteBufferUtil.string(column.getValue());
double value2 = ByteBufferUtil.toDouble(column.getValue());

Did it ever work?


3- Is the column name in your source data files 'temperature' or
'temprature'? You are using the latter in your code and if it is not what
is in the data then you might be trying to parse empty or malformed string.

Regards,
Shahab


On Sat, Oct 5, 2013 at 5:16 AM, Anseh Danesh anseh.dan...@gmail.com wrote:

 Hi all... I have a question. in the cassandra wordcount mapreduce with
 cql3, I want to get a string column and a float (or double) column as map
 input key and value. I mean I want to get date column of type string as key
 and temprature column of type float as value. but when I println value of
 temprature it shows me som of them and then error


 here is the code:
 package org.apache.cassandra.com;

 import java.io.IOException;
 import java.nio.ByteBuffer;
 import java.util.*;
 import java.util.Map.Entry;

 import org.apache.cassandra.hadoop.cql3.CqlConfigHelper;
 import org.apache.cassandra.hadoop.cql3.CqlOutputFormat;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

 import org.apache.cassandra.hadoop.cql3.CqlPagingInputFormat;
 import org.apache.cassandra.hadoop.ConfigHelper;
 import org.apache.cassandra.utils.ByteBufferUtil;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.conf.Configured;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;

 import java.nio.charset.CharacterCodingException;


 public class dewpoint extends Configured implements Tool
 {
 private static final Logger logger =
 LoggerFactory.getLogger(dewpoint.class);

 static final String KEYSPACE = weather;
 static final String COLUMN_FAMILY = momentinfo;

 static final String OUTPUT_REDUCER_VAR = output_reducer;
 static final String OUTPUT_COLUMN_FAMILY = output_words;

 private static final String OUTPUT_PATH_PREFIX = /tmp/dewpointt;

 private static final String PRIMARY_KEY = row_key;

 public static void main(String[] args) throws Exception
 {
 // Let ToolRunner handle generic command-line options
 ToolRunner.run(new Configuration(), new dewpoint(), args);
 System.exit(0);
 }

 public static class TokenizerMapper extends MapperMapString,
 ByteBuffer, MapString, ByteBuffer, Text, IntWritable
 {
 private final static IntWritable one = new IntWritable(1);
 private Text date = new Text();


 public void map(MapString, ByteBuffer keys, MapString,
 ByteBuffer columns, Context context) throws IOException,
 InterruptedException
 {
 for (EntryString, ByteBuffer column : columns.entrySet())
 {
 if (!date.equalsIgnoreCase(column.getKey()) 
 !temprature.equalsIgnoreCase(column.getKey()))
 continue;

 String value1 = ByteBufferUtil.string(column.getValue());
 double value2 = ByteBufferUtil.toDouble(column.getValue());
 System.out.println(value2);
 .


 and here is the error:

 13/10/05 12:36:22 INFO com.dewpoint: output reducer type: filesystem
 13/10/05 12:36:24 INFO util.NativeCodeLoader: Loaded the native-hadoop
 library
 13/10/05 12:36:24 WARN mapred.JobClient: No job jar file set.  User
 classes may not be found. See JobConf(Class) or JobConf#setJar(String).
 13/10/05 12:36:26 INFO mapred.JobClient: Running job:
 job_local1875596001_0001
 13/10/05 12:36:27 INFO mapred.LocalJobRunner: Waiting for map tasks
 13/10/05 12:36:27 INFO mapred.LocalJobRunner: Starting task:
 attempt_local1875596001_0001_m_00_0
 13/10/05 12:36:27 INFO util.ProcessTree: setsid exited with exit code 0
 13/10/05 12:36:27 INFO mapred.Task:  Using ResourceCalculatorPlugin :
 org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1e2670b
 13/10/05 12:36:27 INFO mapred.MapTask: Processing split:
 ColumnFamilySplit((5366152502320075885, '9070993788622720120] @[localhost

Re: Deleting Row Key

2013-10-05 Thread Shahab Yunus

Yes you can:

http://hbase.apache.org/book/regions.arch.html#compaction
http://hbase.apache.org/book/important_configurations.html (Managed
Compaction section)

Regards,
Shahab


On Sat, Oct 5, 2013 at 6:02 PM, Sebastian Schmidt isib...@gmail.com wrote:

 Am 06.10.2013 00:00, schrieb Cem Cayiroglu:
  It will be deleted after a compaction.
 
  Sent from my iPhone
 
  On 05 Oct 2013, at 07:29, Sebastian Schmidt isib...@gmail.com wrote:
 
  Hi,
 
  per default, the key of a row is not deleted, if all columns were
  deleted. I tried to figure out why, but I didn't find an answer, except
  that it is 'intended'. Why is that intended? And is there a possibility
  to manually delete the row key?
 
  Regards
  Sebastian
 Okay, thanks.

 Can I manually start the compaction process?

Re: Deleting Row Key

2013-10-05 Thread Shahab Yunus

Sorry, I replied to the wrong list with HBase info.

Here is Cassandra's link about invoking compaction manually through nodetool
http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/tools/toolsNodetool_r.html?pagename=docsversion=1.2file=references/nodetool

Regards,
Shahab


On Sat, Oct 5, 2013 at 7:06 PM, Shahab Yunus shahab.yu...@gmail.com wrote:

 Yes you can:

 http://hbase.apache.org/book/regions.arch.html#compaction
 http://hbase.apache.org/book/important_configurations.html (Managed
 Compaction section)

 Regards,
 Shahab


 On Sat, Oct 5, 2013 at 6:02 PM, Sebastian Schmidt isib...@gmail.comwrote:

 Am 06.10.2013 00:00, schrieb Cem Cayiroglu:
  It will be deleted after a compaction.
 
  Sent from my iPhone
 
  On 05 Oct 2013, at 07:29, Sebastian Schmidt isib...@gmail.com wrote:
 
  Hi,
 
  per default, the key of a row is not deleted, if all columns were
  deleted. I tried to figure out why, but I didn't find an answer, except
  that it is 'intended'. Why is that intended? And is there a possibility
  to manually delete the row key?
 
  Regards
  Sebastian
 Okay, thanks.

 Can I manually start the compaction process?

Re: Cassandra nodetool could not resolve '127.0.0.1': unknown host

2013-09-17 Thread Shahab Yunus

Have you tried specifying your hostname (not localhost) in cassandra.yaml
and start it?

Regards,
Shahab


On Tue, Sep 17, 2013 at 8:39 AM, pradeep kumar pradeepkuma...@gmail.comwrote:

 I am very new to cassandra. Just started exploring.

 I am running a single node cassandra server  facing a problem in seeing
 status of the cassandra using nodetool command.

 i have hostname configured on my VM as myMachineIP cass1 in /etc/hosts

 and

 i configured my cassandra_instal_path/conf/cassandra.yaml file with
 listen_address, rpc_address as localhost and clustername as casscluster

 (also tried with my hostname which is cass1 as listen_address/rpc_address)

 Nut not sure what is the reason why i am not able to get statususing
 nodetool command.

 $ nodetool

 Cannot resolve '127.0.0.1': unknown host

 $ nodetool -host 127.0.0.1

 Cannot resolve '127.0.0.1': unknown host

 $ nodetool -host cass1

 Cannot resolve 'cass1': unknown host

 But i am able to connect to cassandra-cli

 console output:

 Connected to: casscluster on 127.0.0.1/9160 Welcome to Cassandra CLI
 version 1.2.8

 Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.

 my /etc/hosts looks like:

 127.0.0.1 localhost.localdomain localhost.localdomain localhost4
 localhost4.localdomain4 localhost cass1

 ::1 localhost.localdomain localhost.localdomain localhost6
 localhost6.localdomain6 localhost cass1

 [myMachineIP] cass1

 what could be the reason why i am not able to run nodetool?

 Please help.

Re: questions related to the SSTable file

2013-09-17 Thread Shahab Yunus

java8964, basically are you asking that what will happen if we put large
amount of data in one column of one row at once? How will this blob of data
representing one column and one row i.e. cell will be split into multiple
SSTable? Or in such particular cases it will always be one extra large
SSTable? I am also interesting in knowing the answer.

Regards,
Shahab


On Tue, Sep 17, 2013 at 9:50 AM, java8964 java8964 java8...@hotmail.comwrote:

 Thanks Dean for clarification.

 But if I put hundreds of megabyte data of one row through one put, what
 you mean is Cassandra will put all of them into one SSTable, even the data
 is very big, right? Let's assume in this case the Memtables in memory
 reaches its limit by this change.
 What I want to know is if there is possibility 2 SSTables be generated in
 above case, what is the boundary.

 I understand if following changes apply to the same row key as above
 example, additional SSTable file could be generated. That is clear for me.

 Yong

  From: dean.hil...@nrel.gov
  To: user@cassandra.apache.org
  Date: Tue, 17 Sep 2013 07:39:48 -0600
  Subject: Re: questions related to the SSTable file
 
  You have to first understand the rules of
 
  1. Sstables are immutable so Color-1-Data.db will not be modified and
 only deleted once compacted
  2. Memtables are flushed when reaching a limit so if Blue:{hex} is
 modified, it is done in the in-memory memtable that is eventually flushed
  3. Once flushed, it is an SSTable on disk and you have two values for
 hex both with two timestamps so we know which one is the current value
 
  When it finally compacts, the old value can go away.
 
  Dean
 
  From: java8964 java8964 java8...@hotmail.commailto:
 java8...@hotmail.com
  Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
  Date: Tuesday, September 17, 2013 7:32 AM
  To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
  Subject: RE: questions related to the SSTable file
 
  Hi, Takenori:
 
  Thanks for your quick reply. Your explain is clear for me understanding
 what compaction mean, and I also can understand now same row key will exist
 in multi SSTable file.
 
  But beyond that, I want to know what happen if one row data is too large
 to put in one SSTable file. In your example, the same row exist in multi
 SSTable files as it is keeping changing and flushing into the disk at
 runtime. That's fine, in this case, in every SSTable file of the 4, there
 is no single file contains whole data of that row, but each one does
 contain full picture of individual unit ( I don't know what I should call
 this unit, but it will be larger than one column, right?). Just in your
 example, there is no way in any time, we could have SSTable files like
 following, right:
 
  - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #}}]
  - Color-1-Data_1.db: [{Blue: {hex:FF}}]
  - Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]
  - Color-3-Data.db: [{Aqua: {hex: #00}}, {Green: {hex2: #32CD32}},
 {Blue: {}}]
  - Color-4-Data.db: [{Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}]
 
  I don't see any reason Cassandra will ever do that, but just want to
 confirm, as your 'no' answer to my 2 question is confusion.
 
  Another question from my originally email, even though I may get the
 answer already from your example, but just want to confirm it.
  Just use your example, let's say after the first 2 steps:
 
  - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #FF}}]
  - Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]
  There is a incremental backup. After that, there is following changes
 coming:
 
  - Add a column of (key, column, column_value = Green, hex2, #32CD32)
  - Add a row of (key, column, column_value = Aqua, hex, #00)
  - Delete a row of (key = Blue)
   memtable is flushed = Color-3-Data.db 
  Another incremental backup right now.
 
  Now in this case, my assumption is only Color-3-Data.db will be in this
 backup, right? Even though Color-1-Data.db and Color-2-Data.db contains the
 data of the same row key as Color-3-Data.db, but from a incremental backup
 point of view, only Color-3-Data.db will be stored.
 
  The reason I asked those question is that I am thinking to use MapReduce
 jobs to parse the incremental backup files, and rebuild the snapshot in
 Hadoop side. Of course, the column families I am doing is pure Fact data.
 So there is delete/update in Cassandra for these kind of data, just
 appending. But it is still important for me to understand the SSTable
 file's content.
 
  Thanks
 
  Yong
 
 
  
  Date: Tue, 17 Sep 2013 11:12:01 +0900
  From: ts...@cloudian.commailto:ts...@cloudian.com
  To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
  Subject: Re: questions related to the SSTable file
 
  Hi

Re: questions related to the SSTable file

2013-09-17 Thread Shahab Yunus

Thanks Robert for the answer. It makes sense. If that happens then it means
that your design or use case needs some rework ;)

Regards,
Shahab


On Tue, Sep 17, 2013 at 2:37 PM, java8964 java8964 java8...@hotmail.comwrote:

 Another question related to the SSTable files generated in the incremental
 backup is not really ONLY incremental delta, right? It will include more
 than delta in the SSTable files.

 I will use the example to show my question:

 first, we have this data in the SSTable file 1:

 rowkey(1), columns (maker=honda).

 later, if we add one column in the same key:

 rowkey(1), columns (maker=honda, color=blue)

 The data above being flushed to another SSTable file 2. In this case, it
 will be part of the incremental backup at this time. But in fact, it will
 contain both old data (make=honda), plus new changes (color=blue).

 So in fact, incremental backup of Cassandra is just hard link all the new
 SSTable files being generated during the incremental backup period. It
 could contain any data, not just the data being update/insert/delete in
 this period, correct?

 Thanks

 Yong

  From: dean.hil...@nrel.gov
  To: user@cassandra.apache.org
  Date: Tue, 17 Sep 2013 08:11:36 -0600
  Subject: Re: questions related to the SSTable file
 
  Netflix created file streaming in astyanax into cassandra specifically
 because writing too big a column cell is a bad thing. The limit is really
 dependent on use case….do you have servers writing 1000's of 200Meg files
 at the same time….if so, astyanax streaming may be a better way to go there
 where it divides up the file amongst cells and rows.
 
  I know the limit of a row size is really your hard disk space and the
 column count if I remember goes into billions though realistically, I think
 beyond 10 million might slow down a bit….all I know is we tested up to 10
 million columns with no issues in our use-case.
 
  So you mean at this time, I could get 2 SSTable files, both contain
 column Blue for the same row key, right?
 
  Yes
 
  In this case, I should be fine as value of the Blue column contain the
 timestamp to help me to find out which is the last change, right?
 
  Yes
 
  In MR world, each file COULD be processed by different Mapper, but will
 be sent to the same reducer as both data will be shared same key.
 
  If that is the way you are writing it, then yes
 
  Dean
 
  From: Shahab Yunus shahab.yu...@gmail.commailto:shahab.yu...@gmail.com
 
  Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
  Date: Tuesday, September 17, 2013 7:54 AM
  To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
  Subject: Re: questions related to the SSTable file
 
  derstand if following changes apply to the same row key as above
 example, additional SSTable file could be generated. That is

Re: VMs versus Physical machines

2013-09-12 Thread Shahab Yunus

I admit about missing details. Sorry for that. The thing is that I was
looking for guidance at the high-level so we can then sort out myself what
fits our requirements and use-cases (mainly because we are at the stage
that they could be molded according to hardware and software
limitations/features.) So, for example if it is recommended that ' for
heavy reads physical is better etc.')

Anyway, just to give you a quick recap:
1- Cassandra 1.2.8
2- Row is a unique userid and can have one or more columns. Every cell is
basically a blob of data (using Avro.) All information is in this one
table. No joins or other access patters.
3- Writes can be both in bulk (which will of course has less strict
performance requirements) or real-time. All writes would be at the per
userid, hence, row level and constitute of adding new rows (of course with
some column values) or updating specific cells (column) of the existing row.
4- Reads are per userid i.e. row and 90% of the time random reads for a
user. Rather than in bulk.
5- Both reads and write interfaces are exposed through REST service as well
as direct Java client API.
6- Reads and writes, as mentioned in 34 can be for 1 or more columns at a
time.

Regards,
Shahab




On Thu, Sep 12, 2013 at 1:51 AM, Aaron Turner synfina...@gmail.com wrote:





 On Wed, Sep 11, 2013 at 4:40 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Thanks Aaron for the reply. Yes, VMs or the nodes will be in cloud if we
 don't go the physical route.

  Look how Cassandra scales and provides redundancy.  
 But how does it differ for physical machines or VMs (in cloud.) Or after
 your first comment, are you saying that there is no difference whether we
 use physical or VMs (in cloud)?


 They're different, but both can and do work... VM's just require more
 virtual servers then going the physical route.

 Sorry, but without you providing any actual information about your needs
 all you're going to get is generalizations and hand-waving.

VMs versus Physical machines

2013-09-11 Thread Shahab Yunus

Hello,

We are deciding whether to get VMs or physical machines for a Cassandra
cluster. I know this is a very high-level question depending on lots of
factors and in fact I want to know that how to tackle this is and what
factors should we take into consideration while trying to find the answer.

Data size? Writing speed (whether write heavy usecases or not)? Random ead
use-cases? column family design/how we store data?

Any pointers, documents, guidance, advise would be appreciated.

Thanks a lot.

Regards,
Shahab

Re: VMs versus Physical machines

2013-09-11 Thread Shahab Yunus

Thanks Aaron for the reply. Yes, VMs or the nodes will be in cloud if we
don't go the physical route.

 Look how Cassandra scales and provides redundancy.  
But how does it differ for physical machines or VMs (in cloud.) Or after
your first comment, are you saying that there is no difference whether we
use physical or VMs (in cloud)?

Regards,
Shahab


On Wed, Sep 11, 2013 at 7:34 PM, Aaron Turner synfina...@gmail.com wrote:

 Physical machines unless you're running your cluster in the cloud
 (AWS/etc).

 Reason is simple: Look how Cassandra scales and provides redundancy.

 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 https://github.com/synfinatic/tcpreplay - Pcap editing and replay tools
 for Unix  Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin



 On Wed, Sep 11, 2013 at 4:21 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Hello,

 We are deciding whether to get VMs or physical machines for a Cassandra
 cluster. I know this is a very high-level question depending on lots of
 factors and in fact I want to know that how to tackle this is and what
 factors should we take into consideration while trying to find the answer.

 Data size? Writing speed (whether write heavy usecases or not)? Random
 ead use-cases? column family design/how we store data?

 Any pointers, documents, guidance, advise would be appreciated.

 Thanks a lot.

 Regards,
 Shahab

Re: Cassandra Reads

2013-09-06 Thread Shahab Yunus

It only reads till that column (a sequential scan, I believe) and do not
read the whole row. It uses a row-level column index to reduce the amount
of data read.

Much more details at (first 2-3 are must-reads in fact):
http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docsversion=1.2file=#cassandra/dml/dml_about_reads_c.html
http://www.roman10.net/how-apache-cassandra-read-works/
http://wiki.apache.org/cassandra/ArchitectureInternals

Regards,
Shahab


On Fri, Sep 6, 2013 at 6:28 AM, Sridhar Chellappa schellap2...@gmail.comwrote:

 Folks,

 When I read Column(s) from a table, does Cassandra read only that column?
 Or, does it read the entire row into memory and then filters out the
 contents to send only the requested column(s) ?

Re: Help on Cassandra Limitaions

2013-09-06 Thread Shahab Yunus

Also, Sylvain, you have couple of great posts about relationships between
CQL3/Thrift entities and naming issues:

http://www.datastax.com/dev/blog/cql3-for-cassandra-experts
http://www.datastax.com/dev/blog/thrift-to-cql3

I always refer to them when I get confuse :)

Regards,
Shahab


On Fri, Sep 6, 2013 at 3:04 AM, Hannu Kröger hkro...@gmail.com wrote:

 Hi,

 Well, that was a word to word quotation. :)

 Anyways, I think what you just said is a better explanation than those two
 previous ones. I hope it ends up on the wiki page because what it says
 there now is causing confusion, no matter how correct it technically is :)

 Cheers,
 Hannu


 2013/9/6 Sylvain Lebresne sylv...@datastax.com

 Well, I don't know if that's what Patrick replied but that's not correct.
 The wording *is* correct, though it does uses CQL3 terms.
 For CQL3, the term partition is used to describe all the (CQL) rows
 that share the same partition key (If you don't know what the latter is:
 http://cassandra.apache.org/doc/cql3/CQL.html).
 So it says that all the rows sharing a particular partition key
 multiplied by their number of effective columns is capped at 2 billions.

 In the thrift terminology, this means a 'thrift row' (not to be confused
 with a CQL3 row) cannot have more that 2 billions thrift columns'.

 --
 Sylvain


 On Fri, Sep 6, 2013 at 7:55 AM, Hannu Kröger hkro...@gmail.com wrote:

 I asked the same thing earlier and this is what patrick mcfadin replied:
 It's not worded well. Essentially it's saying there is a 2B limit on a
 row. It should be worded a 'CQL row'

 I hope helps.

 Cheers,
 Hannu

 On 6.9.2013, at 8.20, J Ramesh Kumar rameshj1...@gmail.com wrote:

 Hi,

 http://wiki.apache.org/cassandra/CassandraLimitations

 In the above link, I found the below limitation,

 The maximum number of cells (rows x columns) in a single partition is 2
 billion..

 Here what does partition mean ? Is it node (or) column family (or)
 anything else ?

 Thanks,
 Ramesh

Re: Secondary Indexes On Partitioned Time Series Data Question

2013-08-01 Thread Shahab Yunus

Hi Robert,

Can you shed some more light (or point towards some other resource) that
why you think built-in Secondary Indexes should not be used easily or
without much consideration? Thanks.

Regards,
Shahab


On Thu, Aug 1, 2013 at 3:53 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Aug 1, 2013 at 12:49 PM, Gareth Collins 
 gareth.o.coll...@gmail.com wrote:

 Would this be correct? Just making sure I understand how to best use
 secondary indexes in Cassandra with time series data.


 In general unless you ABSOLUTELY NEED the one unique feature of built-in
 Secondary Indexes (atomic update of base row and index) you should just use
 a normal column family for secondary index cases.

 =Rob

Re: Secondary Indexes On Partitioned Time Series Data Question

2013-08-01 Thread Shahab Yunus

Thanks a lot.

Regards,
Shahab


On Thu, Aug 1, 2013 at 8:32 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Aug 1, 2013 at 2:34 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Can you shed some more light (or point towards some other resource) that
 why you think built-in Secondary Indexes should not be used easily or
 without much consideration? Thanks.


 1) Secondary indexes are more or less modeled like a manual pseudo
 Secondary Index CF would be.
 2) Except they are more opaque than doing it yourself. For example you
 cannot see information on them in nodetool cfstats.
 3) And there have been a steady trickle of bugs which relate to their
 implementation, in many cases resulting in them not returning the data they
 should. [1]
 4) These bugs would not apply to a manual pseudo Secondary Index CF.
 5) And the only benefits you get are the marginal convenience of querying
 the secondary index instead of a second CF, and atomic synchronized update.
 6) Which most people do not actually need.

 tl;dr : unless you need the atomic update property, just use a manual
 pseudo secondary index CF

 =Rob

 [1] https://issues.apache.org/jira/browse/CASSANDRA-4785 ,
 https://issues.apache.org/jira/browse/CASSANDRA-5540 ,
 https://issues.apache.org/jira/browse/CASSANDRA-2897 , etc.

Re: VM dimensions for running Cassandra and Hadoop

2013-07-31 Thread Shahab Yunus

Hi Jan,

One question...you say
- I must make sure the disks are directly attached, to prevent
  problems when multiple nodes flush the commit log at the
  same time

What do you mean by that?

Thanks,
Shahab


On Wed, Jul 31, 2013 at 3:10 AM, Jan Algermissen jan.algermis...@nordsc.com
 wrote:

 Jon,

 On 31.07.2013, at 08:15, Jonathan Haddad j...@jonhaddad.com wrote:

 Having just enough RAM to hold the JVM's heap generally isn't a good idea
 unless you're not planning on doing much with the machine.


 Yes, I agree. Two questions though:

 - Do you think that using a JVM heap of, for example, 12 GB and having 16
 available is a bad ratio
   for a 'simple' Cassandra node?

 - As all of my queries will likely ask for most of a row's data, it seems
 enableing row cache will be
   a good thing.
   AFAIU row cache is cached outside the JVM, so I should then probably get
 loads of RAM more to account
   for the row cache?

 Hmm, having said that, I wonder what goes into the JVM heap anyhow. As far
 as caches are concerned it seems
 only the key cache is inside the JVM heap. Does it make sense to have a
 heap size that is much larger than
 the amount of stoarge necessary for all my keys (plus some overhead of
 course).??


 Jan




 Any memory not allocated to a process will generally be put to good use
 serving as page cache. See here: http://en.wikipedia.org/wiki/Page_cache

 Jon


 On Tue, Jul 30, 2013 at 10:51 PM, Jan Algermissen 
 jan.algermis...@nordsc.com wrote:

 Hi,

 thanks for the helpful replies last week.

 It looks as if I will deploy Cassandra on a bunch of VMs and I am now in
 the process of understanding what the dimensions of the VMS should be.

 So far, I understand the following:

 - I need at least 3 VMs for a minimal Cassandra setup
 - I should get another VM to run the Hadoop job controller or
   can that run on one of the Cassandra VMs
 - there is no point in giving the Cassandra JVMs more than
   8-12 GB heap space because of GC, so it seems going beyond 16GB
   RAM per VM makes no sense
 - Each VM needs two disks, to separate commit log from data storage
 - I must make sure the disks are directly attached, to prevent
   problems when multiple nodes flush the commit log at the
   same time
 - I'll be having rather few writes and intend to hold most of the
   data in memory, so spinning disks are fine for the moment

 Does that seem reasonable?

 How should I plan the disk sizes and number of CPU cores?

 Are there any other configuration mistakes to avoid?

 Is there online documentation that discusses such VM sizing questions in
 more detail?

 Jan








 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade

Re: MapReduce response time and speed

2013-07-24 Thread Shahab Yunus

You have lot of questions there so I can't answer all but for the following:
*Can a user of the system define new jobs in an ad-hoc fashion (like a
query) or do map reduce jobs need to be prepared by a developer (e.g. in
RIAK you do a developer to compile-in the job when you need the perormance
of Erlang-based jobs).

Suppose a user indeed can specify a job and send it off to Cassandra for
processing, what is the expected response time?*

You can use high-level tools like Pig, Hive and Oozie But mind you, it will
depend on your data size, complexity of the job, cluster and tune
parameters.

Regards,
Shahab


On Wed, Jul 24, 2013 at 10:33 AM, Jan Algermissen 
jan.algermis...@nordsc.com wrote:

 Hi,

 I am Jan Algermissen (REST-head, freelance programmer/consultant) and
 Cassandra-newbie.

 I am looking at Cassandra for an application I am working on. There will
 be a max. of 10 Million items (Texts and attributes of a retailer's
 products) in the database. There will occasional writes (e.g. price
 updates).

 The use case for the application is to work on the whole data set, item by
 item to produce 'exports'. It will be neccessary to access the full set
 every time. There is no relationship between the items. Processing is done
 iteratively.

 My question: I am thinking that this is an ideal scenario for map-reduce
 but I am unsure about two things:

 Can a user of the system define new jobs in an ad-hoc fashion (like a
 query) or do map reduce jobs need to be prepared by a developer (e.g. in
 RIAK you do a developer to compile-in the job when you need the perormance
 of Erlang-based jobs).

 Suppose a user indeed can specify a job and send it off to Cassandra for
 processing, what is the expected response time?

 Is it possible to reduce the response time (by tuning, adding more nodes)
 to make a result available within a couple of minutes? Or will there most
 certainly be a gap of 10 minutes or so and more?

 I understand that map-reduce is not for ad-hoc 'querying', but my users
 expect the system to feel quasi-ineractive, because they intend to refine
 the processing job based on the results they get. A short gap would be ok,
 but a definite gap in the order of 10+ minutes not.

 (For example, as far as I learned with RIAK you would most certainly have
 such a gap. How about Cassandra? Throwing more nodes at the problem would
 be ok, I just need to understand whether there is a definite 'response time
 penalty' I have to expect no matter what)

 Jan

Re: Unable to describe table in CQL 3

2013-07-23 Thread Shahab Yunus

Rahul,

See this as it was discussed earlier:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Representation-of-dynamically-added-columns-in-table-column-family-schema-using-cqlsh-td7588997.html

Regards,
Shahab


On Tue, Jul 23, 2013 at 2:51 PM, Rahul Gupta rgu...@dekaresearch.comwrote:

  I am using Cassandra ver 1.1.9.7

 Created a Column Family using Cassandra-cli.

 ** **

 create column family events

 with comparator = 'CompositeType(DateType,UTF8Type)'

 and key_validation_class = 'UUIDType'

 and default_validation_class = 'UTF8Type';

 ** **

 I can describe this CF using CQL2 but getting error when trying the same
 describe with CQL 3

 ** **

 cqlsh:CQI desc table events;

 ** **

 /usr/lib/python2.6/site-packages/cqlshlib/cql3handling.py:852:
 UnexpectedTableStructure: Unexpected table structure; may not translate
 correctly to CQL. expected composite key CF to have column aliases, but
 found none

 /usr/lib/python2.6/site-packages/cqlshlib/cql3handling.py:875:
 UnexpectedTableStructure: Unexpected table structure; may not translate
 correctly to CQL. expected [u'KEY'] length to be 2, but it's 1.
 comparator='org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.DateType,org.apache.cassandra.db.marshal.UTF8Type)'
 

 CREATE TABLE events (

   KEY uuid PRIMARY KEY

 ) WITH

   comment='' AND

   caching='KEYS_ONLY' AND

   read_repair_chance=0.10 AND

   gc_grace_seconds=864000 AND

   replicate_on_write='true' AND

   compaction_strategy_class='SizeTieredCompactionStrategy' AND

   compression_parameters:sstable_compression='SnappyCompressor';

 ** **

 Any ideas why CQL3 won’t display Composite columns? What should be done to
 make them compatible?

 ** **

 Thanks,

 *Rahul Gupta*
 *DEKA* *Research  Development* http://www.dekaresearch.com/

 340 Commercial St  Manchester, NH  03101

 P: 603.666.3908 extn. 6504 | C: 603.718.9676

 ** **

 This e-mail and the information, including any attachments, it contains
 are intended to be a confidential communication only to the person or
 entity to whom it is addressed and may contain information that is
 privileged. If the reader of this message is not the intended recipient,
 you are hereby notified that any dissemination, distribution or copying of
 this communication is strictly prohibited. If you have received this
 communication in error, please immediately notify the sender and destroy
 the original message.

 ** **

 --
 This e-mail and the information, including any attachments, it contains
 are intended to be a confidential communication only to the person or
 entity to whom it is addressed and may contain information that is
 privileged. If the reader of this message is not the intended recipient,
 you are hereby notified that any dissemination, distribution or copying of
 this communication is strictly prohibited. If you have received this
 communication in error, please immediately notify the sender and destroy
 the original message.

 Thank you.

 Please consider the environment before printing this email.

Re: Representation of dynamically added columns in table (column family) schema using cqlsh

2013-07-23 Thread Shahab Yunus

See this as this was discussed earlier:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Representation-of-dynamically-added-columns-in-table-column-family-schema-using-cqlsh-td7588997.html

Regards,
Shahab


On Fri, Jul 12, 2013 at 11:13 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 A basic question and it seems that I have a gap in my understanding.

 I have a simple table in Cassandra with multiple column families. I add
 new columns to each of these column families on the fly. When I view (using
 the 'DESCRIBE table' command) the schema of a particular column family, I
 see only one entry for column (bolded below). What is the reason for that?
 The column that I am adding have string names and byte values, written
 using Hector 1.1-3 (
 HFactory.createColumn(...) method).

 CREATE TABLE mytable (
   key text,
   *column1* ascii,
   value blob,
   PRIMARY KEY (key, column1)
 ) WITH COMPACT STORAGE AND
   bloom_filter_fp_chance=0.01 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.00 AND
   gc_grace_seconds=864000 AND
   read_repair_chance=1.00 AND
   replicate_on_write='true' AND
   populate_io_cache_on_flush='false' AND
   compaction={'class': 'SizeTieredCompactionStrategy'} AND
   compression={'sstable_compression': 'SnappyCompressor'};

 cqlsh 3.0.2
 Cassandra 1.2.5
 CQL spec 3.0.0
 Thrift protocol 19.36.0


 Given this, I can also only query on this one column1 or value using the
 'SELECT' statement.

 The OpsCenter on the other hand, displays multiple columns as
 expected. Basically the demarcation of multiple columns i clearer.

 Thanks a lot.

 Regards,
 Shahab

Re: Auto Discovery of Hosts by Clients

2013-07-22 Thread Shahab Yunus

Thanks for you replies.

Regards,
Shahab


On Sun, Jul 21, 2013 at 4:49 PM, aaron morton aa...@thelastpickle.comwrote:

 Give the app the same nodes you have in the seed lists.

 Cheers

 -
 Aaron Morton
 Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 20/07/2013, at 9:32 AM, sankalp kohli kohlisank...@gmail.com wrote:

 With Auto discovery, you can provide the DC you are local to and it will
 only use hosts from that.


 On Fri, Jul 19, 2013 at 2:08 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Hello,

 I want my Thrift client(s) (using hector 1.1-3) to randomly connect to
 any node in the Cassandra (1.2.4) cluster.

 1- One way is that I pass in a comma separated list of hosts and ports to
 the CassandraHostConfguration object.
 2- The other option is that I configure the auto discovery of hosts
 (through setAutoDiscoverHost and related methods) on
 CassandraHostConfguration object while passing only one pair of host/port.

 Is one way better than another or both have their pros and cons according
 to the usecase. In case of 1, it can become unwieldy if the cluster grows.
 In number 2, would I have to be extra careful while adding/removing nodes
 (will it conflict with bootstrapping) or is it business as usual?

 I don't expect to have a multi-DC setup for near future but I believe
 that would be one consideration.

 Is there any other method that I am missing? Is it dependent or varies
 with the client API that I am using?


 Thanks a lot.

 Regards,
 Shahab

Re: Socket buffer size

2013-07-20 Thread Shahab Yunus

I think the former is for client communication to the nodes and the latter
for communication between nodes themselves as evident by the name of the
property. Please feel free to correct me if I am wrong.

Regards,
Shahab

On Saturday, July 20, 2013, Mohammad Hajjat wrote:

 Hi,

 What's the difference between: rpc_send_buff_size_in_bytes and
 internode_send_buff_size_in_bytes?

 I need to set my TCP socket buffer size (for both transmit/receive) to a
 given value and I wasn't sure of the relation between these two
 configurations. Is there any recommendation? Do they have to be equal, one
 less than another, etc.?

 The documentation here is not really helping much!
 http://www.datastax.com/docs/1.2/configuration/node_configuration#rpc-send-buff-size-in-bytes

 Thanks!
 --
 *Mohammad Hajjat*
 *Ph.D. Student*
 *Electrical and Computer Engineering*
 *Purdue University*

Auto Discovery of Hosts by Clients

2013-07-19 Thread Shahab Yunus

Hello,

I want my Thrift client(s) (using hector 1.1-3) to randomly connect to any
node in the Cassandra (1.2.4) cluster.

1- One way is that I pass in a comma separated list of hosts and ports to
the CassandraHostConfguration object.
2- The other option is that I configure the auto discovery of hosts
(through setAutoDiscoverHost and related methods) on
CassandraHostConfguration object while passing only one pair of host/port.

Is one way better than another or both have their pros and cons according
to the usecase. In case of 1, it can become unwieldy if the cluster grows.
In number 2, would I have to be extra careful while adding/removing nodes
(will it conflict with bootstrapping) or is it business as usual?

I don't expect to have a multi-DC setup for near future but I believe that
would be one consideration.

Is there any other method that I am missing? Is it dependent or varies with
the client API that I am using?


Thanks a lot.

Regards,
Shahab

Re: IllegalArgumentException on query with AbstractCompositeType

2013-07-13 Thread Shahab Yunus

Aaron Morton can confirm but I think one problem could be that to create an
index on a field with small number of possible values is not good.

Regards,
Shahab


On Sat, Jul 13, 2013 at 9:14 AM, Tristan Seligmann
mithra...@mithrandi.netwrote:

 On Fri, Jul 12, 2013 at 10:38 AM, aaron morton aa...@thelastpickle.comwrote:

 CREATE INDEX ON conv_msgdata_by_participant_cql(msgReadFlag);

 On general this is a bad idea in Cassandra (also in a relational DB
 IMHO). You will get poor performance from it.


 Could you elaborate on why this is a bad idea?
 --
 mithrandi, i Ainil en-Balandor, a faer Ambar

Representation of dynamically added columns in table (column family) schema using cqlsh

2013-07-12 Thread Shahab Yunus

A basic question and it seems that I have a gap in my understanding.

I have a simple table in Cassandra with multiple column families. I add new
columns to each of these column families on the fly. When I view (using the
'DESCRIBE table' command) the schema of a particular column family, I see
only one entry for column (bolded below). What is the reason for that? The
column that I am adding have string names and byte values, written using
Hector 1.1-3 (
HFactory.createColumn(...) method).

CREATE TABLE mytable (
  key text,
  *column1* ascii,
  value blob,
  PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
  bloom_filter_fp_chance=0.01 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=1.00 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

cqlsh 3.0.2
Cassandra 1.2.5
CQL spec 3.0.0
Thrift protocol 19.36.0


Given this, I can also only query on this one column1 or value using the
'SELECT' statement.

The OpsCenter on the other hand, displays multiple columns as
expected. Basically the demarcation of multiple columns i clearer.

Thanks a lot.

Regards,
Shahab

Re: Representation of dynamically added columns in table (column family) schema using cqlsh

2013-07-12 Thread Shahab Yunus

Thanks Eric for the explanation.

Regards,
Shahab


On Fri, Jul 12, 2013 at 11:13 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 A basic question and it seems that I have a gap in my understanding.

 I have a simple table in Cassandra with multiple column families. I add
 new columns to each of these column families on the fly. When I view (using
 the 'DESCRIBE table' command) the schema of a particular column family, I
 see only one entry for column (bolded below). What is the reason for that?
 The column that I am adding have string names and byte values, written
 using Hector 1.1-3 (
 HFactory.createColumn(...) method).

 CREATE TABLE mytable (
   key text,
   *column1* ascii,
   value blob,
   PRIMARY KEY (key, column1)
 ) WITH COMPACT STORAGE AND
   bloom_filter_fp_chance=0.01 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.00 AND
   gc_grace_seconds=864000 AND
   read_repair_chance=1.00 AND
   replicate_on_write='true' AND
   populate_io_cache_on_flush='false' AND
   compaction={'class': 'SizeTieredCompactionStrategy'} AND
   compression={'sstable_compression': 'SnappyCompressor'};

 cqlsh 3.0.2
 Cassandra 1.2.5
 CQL spec 3.0.0
 Thrift protocol 19.36.0


 Given this, I can also only query on this one column1 or value using the
 'SELECT' statement.

 The OpsCenter on the other hand, displays multiple columns as
 expected. Basically the demarcation of multiple columns i clearer.

 Thanks a lot.

 Regards,
 Shahab

Re: what happen if coordinator node fails during write

2013-06-29 Thread Shahab Yunus

Aaron,

Can you explain a bit when you say that the client needs to support Atomic
Batches in 1.2 and Hector doesn't support it? Does it mean that there is no
way of using atomic batch of inserts through Hector? Or did I misunderstand
you? Feel free to point me to any link or resource, thanks.

Regards,
Shahab

On Friday, June 28, 2013, aaron morton wrote:

 As far as I know in 1.2 coordinator logs request before it updates
 replicas.

 You may be thinking about atomic batches, which are enabled by default for
 1.2 via CQL but must be supported by Thrift clients. I would guess Hector
 is not using them.
 These logs are stored on other machines, which then reply the mutation if
 they have not been removed by a certain time.


 I am writing data to Cassandra by thrift client (not hector) and
 wonder what happen if the coordinator node fails.

 How and when it fails is important.
 But lets say their was an OS level OOM situation and the process was
 killed just after it sent messages to the remote replicas. In that case all
 you know if the request was applied on 0 to RF number of replicas. So it's
 the same as a TimedOutException.

 The request did not complete at the request CL so reads to that data will
 be working eventual consistency until the next successful write.

 Cheers


 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 26/06/2013, at 12:45 PM, Andrey Ilinykh 
 ailin...@gmail.comjavascript:_e({}, 'cvml', 'ailin...@gmail.com');
 wrote:

 It depends on cassandra version. As far as I know in 1.2 coordinator logs
 request before it updates replicas. If it fails it will replay log on
 startup.
 In 1.1 you may have inconsistant state, because only part of your request
 is propagated to replicas.

 Thank you,
   Andrey


 On Tue, Jun 25, 2013 at 5:11 PM, Jiaan Zeng 
 ji...@bloomreach.comjavascript:_e({}, 'cvml', 'ji...@bloomreach.com');
  wrote:

 Hi there,

 I am writing data to Cassandra by thrift client (not hector) and
 wonder what happen if the coordinator node fails. The same question
 applies for bulk loader which uses gossip protocol instead of thrift
 protocol. In my understanding, the HintedHandoff only takes care of
 the replica node fails.

 Thanks.

 --
 Regards,
 Jiaan

Re: block size

2013-06-20 Thread Shahab Yunus

Have you seen this?
http://www.datastax.com/dev/blog/cassandra-file-system-design

Regards,
Shahab


On Thu, Jun 20, 2013 at 3:17 PM, Kanwar Sangha kan...@mavenir.com wrote:

  Hi – What is the block size for Cassandra ? is it taken from the OS
 defaults ?

Re: block size

2013-06-20 Thread Shahab Yunus

Ok. Though the closest that I can find is this (Aaron Morton's great blog):
http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/

I would also like to know the answer as, as such, I also haven't came
across 'block size' as a core concept (or a concept to be considered while
developing with) Cassandra unlike Hadoop.

Regards,
Shahab


On Thu, Jun 20, 2013 at 3:38 PM, Kanwar Sangha kan...@mavenir.com wrote:

  Yes. Is that not specific to hadoop with CFS ? I want to know that If I
 have a data in column of size 500KB, how many IOPS are needed to read that
 ? (assuming we have key cache enabled)

 ** **

 ** **

 *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
 *Sent:* 20 June 2013 14:32
 *To:* user@cassandra.apache.org
 *Subject:* Re: block size

 ** **

 Have you seen this?

 http://www.datastax.com/dev/blog/cassandra-file-system-design

 ** **

 Regards,
 Shahab

 ** **

 On Thu, Jun 20, 2013 at 3:17 PM, Kanwar Sangha kan...@mavenir.com wrote:
 

 Hi – What is the block size for Cassandra ? is it taken from the OS
 defaults ?

 ** **

Re: Dropped mutation messages

2013-06-19 Thread Shahab Yunus

Hello Arthur,

What do you mean by The queries need to be lightened?

Thanks,
Shahb


On Tue, Jun 18, 2013 at 8:47 PM, Arthur Zubarev arthur.zuba...@aol.comwrote:

   Cem hi,

 as per http://wiki.apache.org/cassandra/FAQ#dropped_messages


 Internode messages which are received by a node, but do not get not to be
 processed within rpc_timeout are dropped rather than processed. As the
 coordinator node will no longer be waiting for a response. If the
 Coordinator node does not receive Consistency Level responses before the
 rpc_timeout it will return a TimedOutException to the client. If the
 coordinator receives Consistency Level responses it will return success to
 the client.

 For MUTATION messages this means that the mutation was not applied to all
 replicas it was sent to. The inconsistency will be repaired by Read Repair
 or Anti Entropy Repair.

 For READ messages this means a read request may not have completed.

 Load shedding is part of the Cassandra architecture, if this is a
 persistent issue it is generally a sign of an overloaded node or cluster.

 By the way, I am on C* 1.2.4 too in dev mode, after having my node filled
 with 400 GB I started getting RPC timeouts on large data retrievals, so in
 short, you may need to revise how you query.

 The queries need to be lightened

 /Arthur

  *From:* cem cayiro...@gmail.com
 *Sent:* Tuesday, June 18, 2013 1:12 PM
 *To:* user@cassandra.apache.org
 *Subject:* Dropped mutation messages

  Hi All,

 I have a cluster of 5 nodes with C* 1.2.4.

 Each node has 4 disks 1 TB each.

 I see  a lot of dropped messages after it stores 400 GB  per disk. (1.6 TB
 per node).

 The recommendation was 500 GB max per node before 1.2.  Datastax says that
 we can store terabytes of data per node with 1.2.
 http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning

 Do I need to enable anything to leverage from 1.2? Do you have any other
 advice?

 What should be the path to investigate this?

 Thanks in advance!

 Best Regards,
 Cem.

Re: Unit Testing Cassandra

2013-06-19 Thread Shahab Yunus

Thanks Stephen for you reply and explanation. My bad that I mixed those up
and wasn't clear enough. Yes, I have different 2 requests/questions.

1) One is for the unit testing.

2) Second (in which I am more interested in) is for performance
(stress/load) testing. Let us keep integration aside for now.

I do see some stuff out there but wanted to know recommendations from the
community given their experience.

Regards,
Shahab


On Wed, Jun 19, 2013 at 3:15 AM, Stephen Connolly 
stephen.alan.conno...@gmail.com wrote:

 Unit testing means testing in isolation the smallest part.

 Unit tests should not take more than a few milliseconds to set up and
 verify their assertions.

 As such, if your code is not factored well for testing, you would
 typically use mocking (either by hand, or with mocking libraries) to mock
 out the bits not under test.

 Extensive use of mocks is usually a smell of code that is not well
 designed *for testing*

 If you intend to test components integrated together... That is
 integration testing.

 If you intend to test performance of the whole or significant parts of the
 whole... That is performance testing.

 When searching for the above, you will not get much luck if you are
 looking for them in the context of unit testing as those things are
 *outside the scope of unit testing


 On Wednesday, 19 June 2013, Shahab Yunus wrote:

 Hello,

 Can anyone suggest a good/popular Unit Test tools/frameworks/utilities out
 there for unit testing Cassandra stores? I am looking for testing from
 performance/load and monitoring perspective. I am using 1.2.

 Thanks a lot.

 Regards,
 Shahab



 --
 Sent from my phone

Re: Unit Testing Cassandra

2013-06-19 Thread Shahab Yunus

Thanks Edward, Ben and Dean for the pointers. Yes, I am using Java and
these sounds promising for unit testing, at least.

Regards,
Shahab


On Wed, Jun 19, 2013 at 9:58 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 You really do not need much in java you can use the embedded server.
 Hector wrap a simple class around thiscalled  EmbeddedServerHelper


 On Wednesday, June 19, 2013, Ben Boule ben_bo...@rapid7.com wrote:
  Hi Shabab,
 
  Cassandra-Unit has been helpful for us for running unit tests without
 requiring a real cassandra instance to be running.   We only use this to
 test our DAO code which interacts with the Cassandra client.  It
 basically starts up an embedded instance of cassandra and fools your
 client/driver into using it.  It uses a non-standard port and you just need
 to make sure you can set the port as a parameter into your client code.
 
  https://github.com/jsevellec/cassandra-unit
 
  One important thing is to either clear out the keyspace in between tests
 or carefully separate your data so different tests don't collide with each
 other in the embedded database.
 
  Setup/tear down time is pretty reasonable.
 
  Ben
  
  From: Shahab Yunus [shahab.yu...@gmail.com]
  Sent: Wednesday, June 19, 2013 8:46 AM
  To: user@cassandra.apache.org
  Subject: Re: Unit Testing Cassandra
 
  Thanks Stephen for you reply and explanation. My bad that I mixed those
 up and wasn't clear enough. Yes, I have different 2 requests/questions.
  1) One is for the unit testing.
  2) Second (in which I am more interested in) is for performance
 (stress/load) testing. Let us keep integration aside for now.
  I do see some stuff out there but wanted to know recommendations from
 the community given their experience.
  Regards,
  Shahab
 
  On Wed, Jun 19, 2013 at 3:15 AM, Stephen Connolly 
 stephen.alan.conno...@gmail.com wrote:
 
  Unit testing means testing in isolation the smallest part.
  Unit tests should not take more than a few milliseconds to set up and
 verify their assertions.
  As such, if your code is not factored well for testing, you would
 typically use mocking (either by hand, or with mocking libraries) to mock
 out the bits not under test.
  Extensive use of mocks is usually a smell of code that is not well
 designed *for testing*
  If you intend to test components integrated together... That is
 integration testing.
  If you intend to test performance of the whole or significant parts of
 the whole... That is performance testing.
  When searching for the above, you will not get much luck if you are
 looking for them in the context of unit testing as those things are
 *outside the scope of unit testing
 
  On Wednesday, 19 June 2013, Shahab Yunus wrote:
 
  Hello,
 
  Can anyone suggest a good/popular Unit Test tools/frameworks/utilities
 out
  there for unit testing Cassandra stores? I am looking for testing from
 performance/load and monitoring perspective. I am using 1.2.
 
  Thanks a lot.
 
  Regards,
  Shahab
 
 
  --
  Sent from my phone
 
  This electronic message contains information which may be confidential
 or privileged. The information is intended for the use of the individual or
 entity named above. If you are not the intended recipient, be aware that
 any disclosure, copying, distribution or use of the contents of this
 information is prohibited. If you have received this electronic
 transmission in error, please notify us by e-mail at (
 postmas...@rapid7.com) immediately.

Unit Testing Cassandra

2013-06-18 Thread Shahab Yunus

Hello,

Can anyone suggest a good/popular Unit Test tools/frameworks/utilities out
there for unit testing Cassandra stores? I am looking for testing from
performance/load and monitoring perspective. I am using 1.2.

Thanks a lot.

Regards,
Shahab

Re: Dynamic Columns Question Cassandra 1.2.5, Datastax Java Driver 1.0

2013-06-06 Thread Shahab Yunus

Dynamic columns are not supported in CQL3. We just had a discussion a day
or two ago about this where Eric Stevens explained it. Please see this:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CQL-3-returning-duplicate-keys-td7588181.html

Regards,
Shahab


On Thu, Jun 6, 2013 at 9:02 AM, Joe Greenawalt joe.greenaw...@gmail.comwrote:

 Hi,
 I'm having some problems figuring out how to append a dynamic column on a
 column family using the datastax java driver 1.0 and CQL3 on Cassandra
 1.2.5.  Below is what i'm trying:

 *cqlsh:simplex create table user (firstname text primary key, lastname
 text);
 cqlsh:simplex insert into user (firstname, lastname) values
 ('joe','shmoe');
 cqlsh:simplex select * from user;

  firstname | lastname
 ---+--
joe |shmoe

 cqlsh:simplex insert into user (firstname, lastname, middlename) values
 ('joe','shmoe','lester');
 Bad Request: Unknown identifier middlename
 cqlsh:simplex insert into user (firstname, lastname, middlename) values
 ('john','shmoe','lester');
 Bad Request: Unknown identifier middlename*

 I'm assuming you can do this based on previous based thrift based clients
 like pycassa, and also by reading this:

 The Cassandra data model is a dynamic schema, column-oriented data model.
 This means that, unlike a relational database, you do not need to model all
 of the columns required by your application up front, as each row is not
 required to have the same set of columns. Columns and their metadata can be
 added by your application as they are needed without incurring downtime to
 your application.
 here: http://www.datastax.com/docs/1.2/ddl/index

 Is it a limitation of CQL3 and its connection vs. thrift?
 Or more likely i'm just doing something wrong?

 Thanks,
 Joe

Re: CQL 3 returning duplicate keys

2013-06-05 Thread Shahab Yunus

Thanks Eric.  Yeah, I was asking about the second limitation (about dynamic
columns) and you have explained it well along with pointers to read further.

Regards,
Shahab


On Wed, Jun 5, 2013 at 8:18 AM, Eric Stevens migh...@gmail.com wrote:

 I mentioned a few limitations, so I'm not sure which you refer to.

 As for not being able to access a CQL3 column family via traditional
 approaches, beyond the example I gave above (where cassandra-cli claims it
 does not recognize the column family), here is an article that mentions it:
 http://www.datastax.com/dev/blog/whats-new-in-cql-3-0

 As for not being able to insert dynamic columns, here is what happens if
 you try:

 cqlsh:test insert into test3(key,c1,c2,newcol) values
 ('a3','a3c1','a3c2','a3newcol');
 Bad Request: Unknown identifier newcol


 This is probably alarming, but don't fret, there is an alternative to
 dynamic columns, and that's the new support for CQL3 collections (see
 http://www.datastax.com/dev/blog/cql3_collections).  You have access to
 sets, lists, and maps, as column types, which can be very useful.  Do note
 that you should be careful to limit the size of a given collection because
 collections are read in their entirety in order to access a single element
 of the collection (see that article for more details).  Also, the
 traditional Thrift / column family approach is not deprecated, CQL3 is just
 an alternative (and noncompatible) approach.  If you have a data model
 that's working for you, stick with Thrift / CQL2.

 See Mixing static and dynamic at
 http://www.datastax.com/dev/blog/thrift-to-cql3

 As for standard column families representing as one row per key/column
 pair, you can read more about that here:
 http://www.datastax.com/dev/blog/thrift-to-cql3 - this is also in the
 Mixing static and dynamic section, a little farther down.



 On Tue, Jun 4, 2013 at 3:00 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Thanks Eric for the detailed explanation but can you point to a source or
 document for this restriction in CQL3 tables? Doesn't it take away the main
 feature of the NoSQL store? Or am I am missing something obvious here?

 Regards,
 Shahab


 On Tue, Jun 4, 2013 at 2:12 PM, Eric Stevens migh...@gmail.com wrote:

 If this is a standard column family, not a CQL3 table, then using CQL3
 will not give you the results you expect.

 From cassandra-cli, let's set up some test data:

 [default@unknown] create keyspace test;
 [default@unknown] use test;
 [default@test] create column family test;
 [default@test] set test['a1']['c1'] = 'a1c1';
 [default@test] set test['a1']['c2'] = 'a1c2';
 [default@test] set test['a2']['c1'] = 'a2c1';
 [default@test] set test['a2']['c2'] = 'a2c2';

 Two rows with two columns each, right?  Not as far as CQL3 is concerned:

 cqlsh use test;
 cqlsh:test select * from test;

  key | column1 | value
 -+-+
   a2 |0xc1 | 0xa2c1
   a2 |0xc2 | 0xa2c2
   a1 |0xc1 | 0xa1c1
   a1 |0xc2 | 0xa1c2

 Basically for CQL3, without the additional metadata and enforcement that
 is established by having created the column family as a CQL3 table, CQL
 will treat each key/column pair as a separate row for CQL purposes.  This
 is most likely at least in part due to the fact that CQL3 tables *cannot
 have arbitrary columns *like standard column families can.  It wouldn't
 know what columns are available for display.  This also exposes some of the
 underlying structure behind CQL3 tables.

 CQL 3 is not reverse compatible with CQL 2 for most things.  If you
 cannot migrate your data to a CQL3 table.

 The equivalent structure in CQL3 tables

 cqlsh:test create table test3 (key text PRIMARY KEY, c1 text, c2 text);
 cqlsh:test INSERT INTO test3(key, c1, c2) VALUES ('a1', 'a1c1', 'a1c2');
 cqlsh:test INSERT INTO test3(key, c1, c2) VALUES ('a2', 'a2c1', 'a2c2');
 cqlsh:test select * from test3;

  key | c1   | c2
 -+--+--
   a2 | a2c1 | a2c2
   a1 | a1c1 | a1c2

 This comes with many important restrictions, one of which as mentioned
 is that you cannot have arbitrary columns in a CQL3 table, just like you
 cannot in a traditional relational database.  Likewise you cannot use
 traditional approaches to populating data into a CQL3 table:

 [default@test] get test3['a1'];
 test3 not found in current keyspace.
 [default@test] set test3['a3']['c1'] = 'a3c1';
 test3 not found in current keyspace.
 [default@test] describe test3;
 WARNING: CQL3 tables are intentionally omitted from 'describe' output.




 On Tue, Jun 4, 2013 at 12:56 PM, ekaqu something ekaqu1...@gmail.comwrote:

 I run a 1.1 cluster and currently testing out a 1.2 cluster.  I have
 noticed that with 1.2 it switched to CQL3 which is acting differently than
 I would expect.  When I do select key from \cf\; I get many many
 duplicate keys.  When I did the same with CQL 2 I only get the keys
 defined.  This seems to also be the case for count(*), in cql2 it would
 return the number of keys i have, in 3 it returns way

Re: Multiple JBOD data directory

2013-06-05 Thread Shahab Yunus

Though, I am a newbie bust just had a thought regarding your question 'How
will it handle requests for data which unavailable?', wouldn't the data be
served in that case from other nodes where it has been replicated?

Regards,
Shahab


On Wed, Jun 5, 2013 at 5:32 AM, Christopher Wirt chris.w...@struq.comwrote:

 Hello, 

 ** **

 We’re thinking about using multiple data directories each with its own
 disk and are currently testing this against a RAID0 config.

 ** **

 I’ve seen that there is failure handling with multiple JBOD.

 ** **

 e.g. 

 We have two data directories mounted to separate drives

 /disk1

 /disk2

 One of the drives fails 

 ** **

 Will Cassandra continue to work?

 How will it handle requests for data which unavailable?

 If I want to add an additional drive what is the best way to go about
 redistributing the data? 

 ** **

 Thanks,

 Chris

Re: CQL 3 returning duplicate keys

2013-06-04 Thread Shahab Yunus

Thanks Eric for the detailed explanation but can you point to a source or
document for this restriction in CQL3 tables? Doesn't it take away the main
feature of the NoSQL store? Or am I am missing something obvious here?

Regards,
Shahab


On Tue, Jun 4, 2013 at 2:12 PM, Eric Stevens migh...@gmail.com wrote:

 If this is a standard column family, not a CQL3 table, then using CQL3
 will not give you the results you expect.

 From cassandra-cli, let's set up some test data:

 [default@unknown] create keyspace test;
 [default@unknown] use test;
 [default@test] create column family test;
 [default@test] set test['a1']['c1'] = 'a1c1';
 [default@test] set test['a1']['c2'] = 'a1c2';
 [default@test] set test['a2']['c1'] = 'a2c1';
 [default@test] set test['a2']['c2'] = 'a2c2';

 Two rows with two columns each, right?  Not as far as CQL3 is concerned:

 cqlsh use test;
 cqlsh:test select * from test;

  key | column1 | value
 -+-+
   a2 |0xc1 | 0xa2c1
   a2 |0xc2 | 0xa2c2
   a1 |0xc1 | 0xa1c1
   a1 |0xc2 | 0xa1c2

 Basically for CQL3, without the additional metadata and enforcement that
 is established by having created the column family as a CQL3 table, CQL
 will treat each key/column pair as a separate row for CQL purposes.  This
 is most likely at least in part due to the fact that CQL3 tables *cannot
 have arbitrary columns *like standard column families can.  It wouldn't
 know what columns are available for display.  This also exposes some of the
 underlying structure behind CQL3 tables.

 CQL 3 is not reverse compatible with CQL 2 for most things.  If you cannot
 migrate your data to a CQL3 table.

 The equivalent structure in CQL3 tables

 cqlsh:test create table test3 (key text PRIMARY KEY, c1 text, c2 text);
 cqlsh:test INSERT INTO test3(key, c1, c2) VALUES ('a1', 'a1c1', 'a1c2');
 cqlsh:test INSERT INTO test3(key, c1, c2) VALUES ('a2', 'a2c1', 'a2c2');
 cqlsh:test select * from test3;

  key | c1   | c2
 -+--+--
   a2 | a2c1 | a2c2
   a1 | a1c1 | a1c2

 This comes with many important restrictions, one of which as mentioned is
 that you cannot have arbitrary columns in a CQL3 table, just like you
 cannot in a traditional relational database.  Likewise you cannot use
 traditional approaches to populating data into a CQL3 table:

 [default@test] get test3['a1'];
 test3 not found in current keyspace.
 [default@test] set test3['a3']['c1'] = 'a3c1';
 test3 not found in current keyspace.
 [default@test] describe test3;
 WARNING: CQL3 tables are intentionally omitted from 'describe' output.




 On Tue, Jun 4, 2013 at 12:56 PM, ekaqu something ekaqu1...@gmail.comwrote:

 I run a 1.1 cluster and currently testing out a 1.2 cluster.  I have
 noticed that with 1.2 it switched to CQL3 which is acting differently than
 I would expect.  When I do select key from \cf\; I get many many
 duplicate keys.  When I did the same with CQL 2 I only get the keys
 defined.  This seems to also be the case for count(*), in cql2 it would
 return the number of keys i have, in 3 it returns way more than i really
 have.

 $ cqlsh `hostname` EOF
 use keyspace;
 select count(*) from cf;
 EOF


  count
 ---
  1

 Default LIMIT of 1 was used. Specify your own LIMIT clause to get
 more results.

 $ cqlsh `hostname` -3 EOF
 use keyspace;
 select count(*) from cf;
 EOF


  count
 ---
  1

 Default LIMIT of 1 was used. Specify your own LIMIT clause to get
 more results.


 $ cqlsh `hostname` -2 EOF
 use keyspace;
 select count(*) from cf;
 EOF


  count
 ---
   1934

 1934 rows have really been inserted. Is there something up with cql3 or
 is there something else going on?

 Thanks for your time reading this email.

60 matches

Mail list logo