Re: Announcing Mutagen

2013-05-17 Thread Blair Zajac

On 5/16/13 10:22 PM, Todd Fast wrote:

Mutagen Cassandra is a framework providing schema versioning and
mutation for Apache Cassandra. It is similar to Flyway for SQL databases.

https://github.com/toddfast/mutagen-cassandra

Mutagen is a lightweight framework for applying versioned changes (known
as mutations) to a resource, in this case a Cassandra schema. Mutagen
takes into account the resource's existing state and only applies
changes that haven't yet been applied.


Hi Todd,

Looking at your code and you have the ColumnPrefixDistributedRowLock 
commented out.  Could it be that the mutation is taking longer than a 
second to run?  Are they only happening during testing simultaneous 
updates?  Maybe they aren't being cleaned up?


Funny timing, I'm working on porting Scala Migrations [1] to Cassandra 
and have a working implementation.  It's not as fancy as Scala 
Migrations (it doesn't scan a package for migration subclasses and it 
currently doesn't do rollbacks) but it gets the basics done.  Hoping to 
release code in the near future.


Differences from Mutagen:

1) Mutations are written only in Scala.
2) Since its a new project, it uses a Java Driver session instead of a 
Astyanax connection since I only intend to use CQL3 tables.


Blair

[1] http://code.google.com/p/scala-migrations/


Re: how to access data only on specific node

2013-05-17 Thread Sergey Naumov
Oh, I finally understand. As I read records one by one they aren't
necessarily read from a single node, so if I got 965 records out of 1000,
some of them could be read from other nodes which have all of 1000 records.

And about range scan - as far as I understand, range scan could be done
only with Order Preserving Partitioner, but not with Random Partitioner...
It would be cool to have consistency level of LOCAL to examine content of a
local node for test purposes.


2013/5/17 aaron morton aa...@thelastpickle.com

 Are you using a multi get or a range slice ?

 Read Repair does not run for range slice queries.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 15/05/2013, at 6:51 PM, Sergey Naumov sknau...@gmail.com wrote:

 see that RR works, but sometimes number of records have been read
 degrades.

 RR is enabled on a random 10% of requests, see the read_repair_chance
 setting for the CF.

 OK, but I forgot to mention the main thing - each node in my config is a
 standalone datacenter and distribution is DC1:1, DC2:1, DC3:1. So when I
 try to read 1000 records with consistency ONE multiple times while
 connected to node that just have been turned on, I got the following count
 of records read (approximately): 120 220 310 390  950 960 965 !! 955 !!
 970 ... If all other nodes contain 1000 records and read repair already
 delivered 965 records to local DC (and so - local node), why sometimes I
 see degradation of total records read?



 2013/5/15 aaron morton aa...@thelastpickle.com

 see that RR works, but sometimes number of records have been read
 degrades.

 RR is enabled on a random 10% of requests, see the read_repair_chance
 setting for the CF.

  If so, then the question is: how to perform local reads to examine
 content of specific node?

 You can check which nodes are replicas for a key using
 nodetool getendpoints

 If you want to read all the rows for a particular row you need to use a
 range scan and limit it by the token ranges assigned to the node.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 14/05/2013, at 10:29 PM, Sergey Naumov sknau...@gmail.com wrote:

 Hello.

 I'am playing with demo cassandra cluster and decided to test read repair
 + hinted handoff.

 One node of a cluster was put down deliberately, and on the other nodes I
 inserted some records (say 1000). HH is off on all nodes.
 Then I turned on the node, connected to it with cql (locally, so to
 localhost) and performed 1000 reads by row key (with consistency ONE). I
 see that RR works, but sometimes number of records have been read degrades.
 Is it because consistency ONE and local reads is not the same thing? If so,
 then the question is: how to perform local reads to examine content of
 specific node?

 Thanks in advance,
 Sergey Naumov.







Re: How to add new DC to cluster when GossipingPropertyFileSnitch is used

2013-05-17 Thread Sergey Naumov
If I understand you correctly, GossipingPropertyFileSnitch is useful for
manipulations with nodes within a single DC, but to add a new DC without
having to restart every node in all DCs (because seeds are specified in
cassandra.yaml and I need to restart a node after addition of a new seed
from newly created DC), I anyway have to use cassandra-topology.properties
and edit it on every node of a cluster.

By the way, it it necessary to specify seeds if I use PropertyFileSnitch
and there is info in cassandra-topology.properties about all nodes of a
cluster?


2013/5/17 aaron morton aa...@thelastpickle.com

 You should configure the seeds as recommended regardless of the snitch
 used.

 You need to update the yaml file to start using the
 GossipingPropertyFileSnitch but after that it reads the
 cassandra-rackdc.properties file to get information about the node. It
 reads uses the information in gossip to get information about the other
 nodes in the cluster.

 If there is no info in gossip about a remote node, because say it has not
 been upgraded, it will fall back to using cassandra-topology.properties.

 Hope that helps.

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 15/05/2013, at 8:10 PM, Sergey Naumov sknau...@gmail.com wrote:

  As far as I understand, GossipingPropertyFileSnitch supposed to provide
 more flexibility in nodes addition/removal. But what about addition of a
 DC? In datastax documentation (
 http://www.datastax.com/docs/1.2/operations/add_replace_nodes#add-dc) it
 is said that cassandra-topology.properties could be updated without restart
 for PropertyFileSnitch. But here (
 http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) it it
 said, that you MUST include at least one node from EACH data center. It is
 a best practice to have at more than one seed node per data center and the
 seed list should be the same for each node. At the first glance it seems
 that PropertyFileSnitch will get necessary info from
 cassandra-topology.properties, but for GossipingPropertyFileSnitch
 modification of cassandra.yaml and restart of all nodes in all DCs will be
 required. Could somebody clarify this topic?
 
  Thanks in advance,
  Sergey Naumov.




Re: How to add new DC to cluster when GossipingPropertyFileSnitch is used

2013-05-17 Thread Igor
I see no reason to restart all nodes. You can continue to use seed from 
first DC - seed used for loading ring configuration(locations, token 
ranges, etc), not data.


On 05/17/2013 10:34 AM, Sergey Naumov wrote:
If I understand you correctly, GossipingPropertyFileSnitch is useful 
for manipulations with nodes within a single DC, but to add a new DC 
without having to restart every node in all DCs (because seeds are 
specified in cassandra.yaml and I need to restart a node after 
addition of a new seed from newly created DC), I anyway have to use 
cassandra-topology.properties and edit it on every node of a cluster.


By the way, it it necessary to specify seeds if I use 
PropertyFileSnitch and there is info in cassandra-topology.properties 
about all nodes of a cluster?




Yes, it is. Cassandra need seed(s), because topology properties have no 
info about token ranges.




2013/5/17 aaron morton aa...@thelastpickle.com 
mailto:aa...@thelastpickle.com


You should configure the seeds as recommended regardless of the
snitch used.

You need to update the yaml file to start using the
GossipingPropertyFileSnitch but after that it reads the
cassandra-rackdc.properties file to get information about the
node. It reads uses the information in gossip to get information
about the other nodes in the cluster.

If there is no info in gossip about a remote node, because say it
has not been upgraded, it will fall back to using
cassandra-topology.properties.

Hope that helps.

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/05/2013, at 8:10 PM, Sergey Naumov sknau...@gmail.com
mailto:sknau...@gmail.com wrote:

 As far as I understand, GossipingPropertyFileSnitch supposed to
provide more flexibility in nodes addition/removal. But what about
addition of a DC? In datastax documentation
(http://www.datastax.com/docs/1.2/operations/add_replace_nodes#add-dc)
it is said that cassandra-topology.properties could be updated
without restart for PropertyFileSnitch. But here
(http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) it
it said, that you MUST include at least one node from EACH data
center. It is a best practice to have at more than one seed node
per data center and the seed list should be the same for each
node. At the first glance it seems that PropertyFileSnitch will
get necessary info from cassandra-topology.properties, but for
GossipingPropertyFileSnitch modification of cassandra.yaml and
restart of all nodes in all DCs will be required. Could somebody
clarify this topic?

 Thanks in advance,
 Sergey Naumov.






Re: How to add new DC to cluster when GossipingPropertyFileSnitch is used

2013-05-17 Thread Sergey Naumov
But I've read in some sources (for example
http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) that
seed list MUST include at least one seed from each DC and seed lists should
be the same for each node.

Or it is fine if nodes from new DC have all seeds specified and nodes from
old DCs have all seeds specified except seeds from new DC? In such
interpretation rules have to be a bit modified:
1. Nodes from the same DC should have identical seeds lists.
2. At least at one DC nodes MUST have in its seed lists seeds from all
other DCs.


2013/5/17 Igor i...@4friends.od.ua

  I see no reason to restart all nodes. You can continue to use seed from
 first DC - seed used for loading ring configuration(locations, token
 ranges, etc), not data.

 On 05/17/2013 10:34 AM, Sergey Naumov wrote:

  If I understand you correctly, GossipingPropertyFileSnitch is useful for
 manipulations with nodes within a single DC, but to add a new DC without
 having to restart every node in all DCs (because seeds are specified in
 cassandra.yaml and I need to restart a node after addition of a new seed
 from newly created DC), I anyway have to use cassandra-topology.properties
 and edit it on every node of a cluster.

  By the way, it it necessary to specify seeds if I use PropertyFileSnitch
 and there is info in cassandra-topology.properties about all nodes of a
 cluster?


 Yes, it is. Cassandra need seed(s), because topology properties have no
 info about token ranges.



 2013/5/17 aaron morton aa...@thelastpickle.com

 You should configure the seeds as recommended regardless of the snitch
 used.

 You need to update the yaml file to start using the
 GossipingPropertyFileSnitch but after that it reads the
 cassandra-rackdc.properties file to get information about the node. It
 reads uses the information in gossip to get information about the other
 nodes in the cluster.

 If there is no info in gossip about a remote node, because say it has not
 been upgraded, it will fall back to using cassandra-topology.properties.

 Hope that helps.

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 15/05/2013, at 8:10 PM, Sergey Naumov sknau...@gmail.com wrote:

  As far as I understand, GossipingPropertyFileSnitch supposed to provide
 more flexibility in nodes addition/removal. But what about addition of a
 DC? In datastax documentation (
 http://www.datastax.com/docs/1.2/operations/add_replace_nodes#add-dc) it
 is said that cassandra-topology.properties could be updated without restart
 for PropertyFileSnitch. But here (
 http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) it it
 said, that you MUST include at least one node from EACH data center. It is
 a best practice to have at more than one seed node per data center and the
 seed list should be the same for each node. At the first glance it seems
 that PropertyFileSnitch will get necessary info from
 cassandra-topology.properties, but for GossipingPropertyFileSnitch
 modification of cassandra.yaml and restart of all nodes in all DCs will be
 required. Could somebody clarify this topic?
 
  Thanks in advance,
  Sergey Naumov.






Re: How to add new DC to cluster when GossipingPropertyFileSnitch is used

2013-05-17 Thread Igor

On 05/17/2013 11:19 AM, Sergey Naumov wrote:
But I've read in some sources (for example 
http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) 
that seed list MUST include at least one seed from each DC and seed 
lists should be the same for each node.


Or it is fine if nodes from new DC have all seeds specified and nodes 
from old DCs have all seeds specified except seeds from new DC? In 
such interpretation rules have to be a bit modified:


I never have problems with adding new nodes and new DC having single 
seed per cluster in one old DC.



1. Nodes from the same DC should have identical seeds lists.
2. At least at one DC nodes MUST have in its seed lists seeds from all 
other DCs.



2013/5/17 Igor i...@4friends.od.ua mailto:i...@4friends.od.ua

I see no reason to restart all nodes. You can continue to use seed
from first DC - seed used for loading ring
configuration(locations, token ranges, etc), not data.

On 05/17/2013 10:34 AM, Sergey Naumov wrote:

If I understand you correctly, GossipingPropertyFileSnitch is
useful for manipulations with nodes within a single DC, but to
add a new DC without having to restart every node in all DCs
(because seeds are specified in cassandra.yaml and I need to
restart a node after addition of a new seed from newly created
DC), I anyway have to use cassandra-topology.properties and edit
it on every node of a cluster.

By the way, it it necessary to specify seeds if I use
PropertyFileSnitch and there is info in
cassandra-topology.properties about all nodes of a cluster?



Yes, it is. Cassandra need seed(s), because topology properties
have no info about token ranges.




2013/5/17 aaron morton aa...@thelastpickle.com
mailto:aa...@thelastpickle.com

You should configure the seeds as recommended regardless of
the snitch used.

You need to update the yaml file to start using the
GossipingPropertyFileSnitch but after that it reads the
cassandra-rackdc.properties file to get information about the
node. It reads uses the information in gossip to get
information about the other nodes in the cluster.

If there is no info in gossip about a remote node, because
say it has not been upgraded, it will fall back to using
cassandra-topology.properties.

Hope that helps.

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/05/2013, at 8:10 PM, Sergey Naumov sknau...@gmail.com
mailto:sknau...@gmail.com wrote:

 As far as I understand, GossipingPropertyFileSnitch
supposed to provide more flexibility in nodes
addition/removal. But what about addition of a DC? In
datastax documentation
(http://www.datastax.com/docs/1.2/operations/add_replace_nodes#add-dc)
it is said that cassandra-topology.properties could be
updated without restart for PropertyFileSnitch. But here
(http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc)
it it said, that you MUST include at least one node from
EACH data center. It is a best practice to have at more than
one seed node per data center and the seed list should be the
same for each node. At the first glance it seems that
PropertyFileSnitch will get necessary info from
cassandra-topology.properties, but for
GossipingPropertyFileSnitch modification of cassandra.yaml
and restart of all nodes in all DCs will be required. Could
somebody clarify this topic?

 Thanks in advance,
 Sergey Naumov.









update does not apply to any replica if consistency = ALL and one replica is down

2013-05-17 Thread Sergey Naumov
As described here (
http://maxgrinev.com/2010/07/12/update-idempotency-why-it-is-important-in-cassandra-applications-2/),
if consistency level couldn't be met, updates are applied anyway on
functional replicas, and they could be propagated later to other replicas
using repair mechanisms or by issuing the same request later, as update
operations are idempotent in Cassandra.

But... on my configuration (Cassandra 1.2.4, python CQL 1.0.4, DC1 - 3
nodes, DC2 - 3 nodes, DC3 - 1 node, RF={DC1:3, DC2:2, DC3:1}, Random
Partitioner, GossipingPropertyFileSnitch, one node in DC1 is deliberately
down - and, as RF for DC1 is 3, this down node is a replica node for 100%
of records),  when I try to insert one record with consistency level of
ALL, this insert does not appear on any replica (-s30 - is a serial of
UUID1: 001e--1000--x (30 is 1e in hex), -n1 mean
that we will insert/update a single record with first id from this series -
001e--1000--):
*write with consistency ALL:*
cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cALL
Traceback (most recent call last):
  File ./aux/fastinsert.py, line 54, in insert
curs.execute(cmd, consistency_level=p.conlvl)
OperationalError: Unable to complete request: one or more nodes were
unavailable.
Last record UUID is 001e--1000--*

*
about 10 seconds passed...
*
read with consistency ONE:*
cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cONE
Total records read: *0*
Last record UUID is 001e--1000--
*read with consistency QUORUM:*
cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
Total records read: *0*
Last record UUID is 001e--1000--
*write with consistency QUORUM:*
cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cQUORUM
Last record UUID is 001e--1000--
*read with consistency QUORUM:*
cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
Total records read: *1*
Last record UUID is 001e--1000--

Is it a new feature of Cassandra that it does not perform a write to any
replica if consistency couldn't be satisfied? If so, then is it true for
all cases, for example when returned error is OperationalError: Request
did not complete within rpc_timeout?

Thanks in advance,
Sergey Naumov.


Re: Announcing Mutagen

2013-05-17 Thread Todd Fast
Hi Blair--

Thanks for digging into the code. I did indeed experiment with longer
timeouts and the result was that trying to obtain the lock hung for
whatever amount of time I set the timeout for. I am not an expert on
Astyanax and haven't debugged my use of that recipe yet; I don't even know
if I've configured it correctly. Perhaps you have some guidance?

(Funny you mention your own migration framework--Mutagen is the second one
I've done for Cassandra. The first one, a plugin for Mokol, also had schema
rollbacks and some other features, but was only command-line.)


On Thu, May 16, 2013 at 11:06 PM, Blair Zajac bl...@orcaware.com wrote:

 On 5/16/13 10:22 PM, Todd Fast wrote:

 Mutagen Cassandra is a framework providing schema versioning and
 mutation for Apache Cassandra. It is similar to Flyway for SQL databases.

 https://github.com/toddfast/**mutagen-cassandrahttps://github.com/toddfast/mutagen-cassandra

 Mutagen is a lightweight framework for applying versioned changes (known
 as mutations) to a resource, in this case a Cassandra schema. Mutagen
 takes into account the resource's existing state and only applies
 changes that haven't yet been applied.


 Hi Todd,

 Looking at your code and you have the ColumnPrefixDistributedRowLock
 commented out.  Could it be that the mutation is taking longer than a
 second to run?  Are they only happening during testing simultaneous
 updates?  Maybe they aren't being cleaned up?

 Funny timing, I'm working on porting Scala Migrations [1] to Cassandra and
 have a working implementation.  It's not as fancy as Scala Migrations (it
 doesn't scan a package for migration subclasses and it currently doesn't do
 rollbacks) but it gets the basics done.  Hoping to release code in the near
 future.

 Differences from Mutagen:

 1) Mutations are written only in Scala.
 2) Since its a new project, it uses a Java Driver session instead of a
 Astyanax connection since I only intend to use CQL3 tables.

 Blair

 [1] 
 http://code.google.com/p/**scala-migrations/http://code.google.com/p/scala-migrations/



Re: Announcing Mutagen

2013-05-17 Thread Edward Capriolo
Now that comparators can be changed I am internally wondering if every
column, rowkey,value in c* should be a dynamic composite and then
everything can evolve


On Fri, May 17, 2013 at 5:35 AM, Todd Fast t...@toddfast.com wrote:

 Hi Blair--

 Thanks for digging into the code. I did indeed experiment with longer
 timeouts and the result was that trying to obtain the lock hung for
 whatever amount of time I set the timeout for. I am not an expert on
 Astyanax and haven't debugged my use of that recipe yet; I don't even know
 if I've configured it correctly. Perhaps you have some guidance?

 (Funny you mention your own migration framework--Mutagen is the second one
 I've done for Cassandra. The first one, a plugin for Mokol, also had schema
 rollbacks and some other features, but was only command-line.)


 On Thu, May 16, 2013 at 11:06 PM, Blair Zajac bl...@orcaware.com wrote:

 On 5/16/13 10:22 PM, Todd Fast wrote:

 Mutagen Cassandra is a framework providing schema versioning and
 mutation for Apache Cassandra. It is similar to Flyway for SQL databases.

 https://github.com/toddfast/**mutagen-cassandrahttps://github.com/toddfast/mutagen-cassandra

 Mutagen is a lightweight framework for applying versioned changes (known
 as mutations) to a resource, in this case a Cassandra schema. Mutagen
 takes into account the resource's existing state and only applies
 changes that haven't yet been applied.


 Hi Todd,

 Looking at your code and you have the ColumnPrefixDistributedRowLock
 commented out.  Could it be that the mutation is taking longer than a
 second to run?  Are they only happening during testing simultaneous
 updates?  Maybe they aren't being cleaned up?

 Funny timing, I'm working on porting Scala Migrations [1] to Cassandra
 and have a working implementation.  It's not as fancy as Scala Migrations
 (it doesn't scan a package for migration subclasses and it currently
 doesn't do rollbacks) but it gets the basics done.  Hoping to release code
 in the near future.

 Differences from Mutagen:

 1) Mutations are written only in Scala.
 2) Since its a new project, it uses a Java Driver session instead of a
 Astyanax connection since I only intend to use CQL3 tables.

 Blair

 [1] 
 http://code.google.com/p/**scala-migrations/http://code.google.com/p/scala-migrations/





C language - cassandra

2013-05-17 Thread Apostolis Xekoukoulotakis
Hello, new here, What are my options in using cassandra from a program
written in c?

A)
Thrift has no documentation, so it will take me time to understand.
Thrift also doesnt have a balancing pool, asking different nodes every
time, which is a big problem.

B)
Should I use the hector (java) client and then send the data to my program
with my own protocol?
Seems a lot of unnecessary work.

Any other suggestions?


-- 


Sincerely yours,

 Apostolis Xekoukoulotakis


Logging Cassandra queries

2013-05-17 Thread Tomàs Núnez
Hi!

For quite time I've been having some unexpected loadavg in the cassandra
servers. I suspect there are lots of uncontrolled queries to the cassandra
servers causing this load, but the developers say that there are none, and
the load is due to cassandra internal processes.

Trying to get to the bottom, I've been looking into completed ReadStage and
MutationStage through JMX, and the numbers seem to confirm my theory, but
I'd like to go one step forward and, if possible, list all the queries from
the webservers to the cassandra cluster (just one node would be enough).

I've been playing with cassandra loglevels, and I can see when a Read or a
Write is done, but it would be better if I could knew the CF of the query.
For my tests I've put the in the log4j.server
 log4j.rootLogger=DEBUG,stdout,R, writing and reading a test CF, and I
can't see the name of it anywhere.

For the test I'm using Cassandra 0.8.4 (yes, still), as my production
servers, and also 1.0.11. Maybe this changes in 1.1? Maybe I'm doing
something wrong? Any hint?

And... could I be more precise when enabling logging? Because right now,
with *log4j.rootLogger=DEBUG,stdout,R* I'm getting a lot of information I
won't use ever, and I'd like to enable just what I need to see gets and
seds

Thanks in advance,
Tomàs


Re: C language - cassandra

2013-05-17 Thread Mina Naguib

Hi Apostolis

I'm the author of libcassie, a C library for cassandra that wraps the C++ 
libcassandra library.  

It's in use in production where I work, however it has not received much 
traction elsewhere as far as I know.  You can get it here:
https://github.com/minaguib/libcassandra/tree/kickstart-libcassie-0.7

It has not been updated for a while (for example no CQL support, no pooling 
support).  I've been waiting for either the thrift C-glibc interface to mature, 
or the thriftless-CQL-binary protocol to mature, before putting effort into 
updating/rewriting it.  It might however satisfy your needs with its current 
functionality.



On 2013-05-17, at 10:42 AM, Apostolis Xekoukoulotakis xekou...@gmail.com 
wrote:

 Hello, new here, What are my options in using cassandra from a program 
 written in c?
 
 A)
 Thrift has no documentation, so it will take me time to understand.
 Thrift also doesnt have a balancing pool, asking different nodes every time, 
 which is a big problem.
 
 B)
 Should I use the hector (java) client and then send the data to my program 
 with my own protocol? 
 Seems a lot of unnecessary work.
 
 Any other suggestions?
 
 
 -- 
 
 Sincerely yours, 
  Apostolis Xekoukoulotakis



Re: best practices on EC2 question

2013-05-17 Thread aaron morton
  b) do people skip backups altogether except for huge outages and just let 
 rebooted server instances come up empty to repopulate via C*?
This one. 
Bootstrapping a new node into the cluster has a small impact on the existing 
nodes and the new nodes to have all the data they need when the finish the 
process.

Cheers
  
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/05/2013, at 3:17 AM, Janne Jalkanen janne.jalka...@ecyrd.com wrote:

 On May 16, 2013, at 17:05 , Brian Tarbox tar...@cabotresearch.com wrote:
 
 An alternative that we had explored for a while was to do a two stage backup:
 1) copy a C* snapshot from the ephemeral drive to an EBS drive
 2) do an EBS snapshot to S3.
 
 The idea being that EBS is quite reliable, S3 is still the emergency backup 
 and copying back from EBS to ephemeral is likely much faster than the 15 
 MB/sec we get from S3.
 
 Yup, this is what we do.  We use rsync with --bwlimit=4000 to copy the 
 snapshots from the eph drive to EBS; this is intentionally very low so that 
 the backup process does not take eat our I/O.  This is on m1.xlarge 
 instances; YMMV so measure :).  EBS drives are then snapshot with 
 ec2-consistent-snapshot and then old snapshots expired using 
 ec2-expire-snapshots (I believe these scripts are from Alestic).
 
 /Janne
 



Re: update does not apply to any replica if consistency = ALL and one replica is down

2013-05-17 Thread Bryan Talbot
I think you're conflating may with must.  That article says that
updates may still be applied to some replicas when there is a failure and
I believe that still is the case.  However, if the coordinator knows that
the CL can't be met before even attempting the write, I don't think it will
attempt the write.

-Bryan



On Fri, May 17, 2013 at 1:48 AM, Sergey Naumov sknau...@gmail.com wrote:

 As described here (
 http://maxgrinev.com/2010/07/12/update-idempotency-why-it-is-important-in-cassandra-applications-2/),
 if consistency level couldn't be met, updates are applied anyway on
 functional replicas, and they could be propagated later to other replicas
 using repair mechanisms or by issuing the same request later, as update
 operations are idempotent in Cassandra.

 But... on my configuration (Cassandra 1.2.4, python CQL 1.0.4, DC1 - 3
 nodes, DC2 - 3 nodes, DC3 - 1 node, RF={DC1:3, DC2:2, DC3:1}, Random
 Partitioner, GossipingPropertyFileSnitch, one node in DC1 is deliberately
 down - and, as RF for DC1 is 3, this down node is a replica node for 100%
 of records),  when I try to insert one record with consistency level of
 ALL, this insert does not appear on any replica (-s30 - is a serial of
 UUID1: 001e--1000--x (30 is 1e in hex), -n1 mean
 that we will insert/update a single record with first id from this series -
 001e--1000--):
 *write with consistency ALL:*
 cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cALL
 Traceback (most recent call last):
   File ./aux/fastinsert.py, line 54, in insert
 curs.execute(cmd, consistency_level=p.conlvl)
 OperationalError: Unable to complete request: one or more nodes were
 unavailable.
 Last record UUID is 001e--1000--*

 *
 about 10 seconds passed...
 *
 read with consistency ONE:*
 cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cONE
 Total records read: *0*
 Last record UUID is 001e--1000--
 *read with consistency QUORUM:*
 cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
 Total records read: *0*
 Last record UUID is 001e--1000--
 *write with consistency QUORUM:*
 cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cQUORUM
 Last record UUID is 001e--1000--
 *read with consistency QUORUM:*
 cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
 Total records read: *1*
 Last record UUID is 001e--1000--

 Is it a new feature of Cassandra that it does not perform a write to any
 replica if consistency couldn't be satisfied? If so, then is it true for
 all cases, for example when returned error is OperationalError: Request
 did not complete within rpc_timeout?

 Thanks in advance,
 Sergey Naumov.



Re: C language - cassandra

2013-05-17 Thread Apostolis Xekoukoulotakis
Thanks Mina for your work.

One other option could be to use pycassa and link the code with my c
program, but I have no experience with python at all. Maybe this will be
better since pycassa seems to have a strong community.


2013/5/17 Mina Naguib mina.nag...@adgear.com


 Hi Apostolis

 I'm the author of libcassie, a C library for cassandra that wraps the C++
 libcassandra library.

 It's in use in production where I work, however it has not received much
 traction elsewhere as far as I know.  You can get it here:
 https://github.com/minaguib/libcassandra/tree/kickstart-libcassie-0.7

 It has not been updated for a while (for example no CQL support, no
 pooling support).  I've been waiting for either the thrift C-glibc
 interface to mature, or the thriftless-CQL-binary protocol to mature,
 before putting effort into updating/rewriting it.  It might however satisfy
 your needs with its current functionality.



 On 2013-05-17, at 10:42 AM, Apostolis Xekoukoulotakis xekou...@gmail.com
 wrote:

 Hello, new here, What are my options in using cassandra from a program
 written in c?

 A)
 Thrift has no documentation, so it will take me time to understand.
 Thrift also doesnt have a balancing pool, asking different nodes every
 time, which is a big problem.

 B)
 Should I use the hector (java) client and then send the data to my program
 with my own protocol?
 Seems a lot of unnecessary work.

 Any other suggestions?


 --


 Sincerely yours,

  Apostolis Xekoukoulotakis





-- 


Sincerely yours,

 Apostolis Xekoukoulotakis


Re: best practices on EC2 question

2013-05-17 Thread Robert Coli
On Fri, May 17, 2013 at 11:13 AM, aaron morton aa...@thelastpickle.com wrote:
 Bootstrapping a new node into the cluster has a small impact on the existing
 nodes and the new nodes to have all the data they need when the finish the
 process.

Sorry for the pedantry, but bootstrapping from existing replicas
cannot guarantee that the new nodes have all the data they need when
they finish the process. There is a non-zero chance that the failed
node contained the single under-replicated copy of a given datum. In
practice if your RF is =2, you are unlikely to experience this type
of data loss. But restore-a-backup-then-repair protects you against
this unlikely case.

=Rob


Re: pycassa failures in large batch cycling

2013-05-17 Thread aaron morton
IMHO you are going to have more success breaking up your work load to work with 
the current settings. 

The buffers created by thrift are going to eat up the server side memory. They 
grow dynamically but persist for the life of the connection. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/05/2013, at 3:09 PM, John R. Frank j...@mit.edu wrote:

 On Tue, 14 May 2013, aaron morton wrote:
 
  After several cycles, pycassa starts getting connection failures.
 Do you have the error stack ?Are the TimedOutExceptions or socket time outs 
 or something else.
 
 
 I figured out the problem here and made this ticket in jira:
 
   https://issues.apache.org/jira/browse/CASSANDRA-5575
 
 
 Summary: the Thrift interfaces to Cassandra are simply not able to load large 
 batches without putting the client into an infinite retry loop.
 
 Seems that the only robust solutions involve either features added to Thrift 
 and all Cassandra clients, or a new interface mechanism.
 
 jrf



Re: how to access data only on specific node

2013-05-17 Thread aaron morton
 And about range scan - as far as I understand, range scan could be done only 
 with Order Preserving Partitioner, but not with Random Partitioner.
Range scan can be used with any partitioner. 
If you use it with the RP the order of the rows will be ranged. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/05/2013, at 7:19 PM, Sergey Naumov sknau...@gmail.com wrote:

 Oh, I finally understand. As I read records one by one they aren't 
 necessarily read from a single node, so if I got 965 records out of 1000, 
 some of them could be read from other nodes which have all of 1000 records.
 
 And about range scan - as far as I understand, range scan could be done only 
 with Order Preserving Partitioner, but not with Random Partitioner... It 
 would be cool to have consistency level of LOCAL to examine content of a 
 local node for test purposes.
 
 
 2013/5/17 aaron morton aa...@thelastpickle.com
 Are you using a multi get or a range slice ? 
 
 Read Repair does not run for range slice queries. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 15/05/2013, at 6:51 PM, Sergey Naumov sknau...@gmail.com wrote:
 
 see that RR works, but sometimes number of records have been read degrades. 
 RR is enabled on a random 10% of requests, see the read_repair_chance 
 setting for the CF. 
 
 OK, but I forgot to mention the main thing - each node in my config is a 
 standalone datacenter and distribution is DC1:1, DC2:1, DC3:1. So when I try 
 to read 1000 records with consistency ONE multiple times while connected to 
 node that just have been turned on, I got the following count of records 
 read (approximately): 120 220 310 390  950 960 965 !! 955 !! 970 ... If 
 all other nodes contain 1000 records and read repair already delivered 965 
 records to local DC (and so - local node), why sometimes I see degradation 
 of total records read?
 
 
 
 2013/5/15 aaron morton aa...@thelastpickle.com
 see that RR works, but sometimes number of records have been read degrades. 
 RR is enabled on a random 10% of requests, see the read_repair_chance 
 setting for the CF. 
 
  If so, then the question is: how to perform local reads to examine content 
 of specific node?
 You can check which nodes are replicas for a key using nodetool getendpoints
 
 If you want to read all the rows for a particular row you need to use a 
 range scan and limit it by the token ranges assigned to the node. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 14/05/2013, at 10:29 PM, Sergey Naumov sknau...@gmail.com wrote:
 
 Hello.
 
 I'am playing with demo cassandra cluster and decided to test read repair + 
 hinted handoff. 
 
 One node of a cluster was put down deliberately, and on the other nodes I 
 inserted some records (say 1000). HH is off on all nodes.
 Then I turned on the node, connected to it with cql (locally, so to 
 localhost) and performed 1000 reads by row key (with consistency ONE). I 
 see that RR works, but sometimes number of records have been read degrades. 
 Is it because consistency ONE and local reads is not the same thing? If so, 
 then the question is: how to perform local reads to examine content of 
 specific node?
 
 Thanks in advance,
 Sergey Naumov.
 
 
 
 



[BLOG] : Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine

2013-05-17 Thread Brian O'Neill
FWIW, we were able to integrate Druid and Cassandra.

Its only in PoC right now, but it seems like a powerful combination:
http://brianoneill.blogspot.com/2013/05/cassandra-as-deep-storage-mechanism-for.html

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: How to add new DC to cluster when GossipingPropertyFileSnitch is used

2013-05-17 Thread aaron morton
 But I've read in some sources (for example 
 http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) that seed 
 list MUST include at least one seed from each DC and seed lists should be the 
 same for each node.

That article is about creating a new cluster, to add an a DC to an exiting 
cluster do this:

* set the seed list in the new DC to have seeds from both DC's
* update the seed list in the old DC to have seeds from both later. 

Adding a new DC will normally not happen as often as adding nodes. 

Using the GossipingPropertyFileSnitch means do you not not have to update all 
nodes when adding a new one. 

Cheers


-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/05/2013, at 8:42 PM, Igor i...@4friends.od.ua wrote:

 On 05/17/2013 11:19 AM, Sergey Naumov wrote:
 But I've read in some sources (for example 
 http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) that seed 
 list MUST include at least one seed from each DC and seed lists should be 
 the same for each node.
 
 Or it is fine if nodes from new DC have all seeds specified and nodes from 
 old DCs have all seeds specified except seeds from new DC? In such 
 interpretation rules have to be a bit modified:
 
 I never have problems with adding new nodes and new DC having single seed per 
 cluster in one old DC.
 
 1. Nodes from the same DC should have identical seeds lists.
 2. At least at one DC nodes MUST have in its seed lists seeds from all other 
 DCs.
 
 
 2013/5/17 Igor i...@4friends.od.ua
 I see no reason to restart all nodes. You can continue to use seed from 
 first DC - seed used for loading ring configuration(locations, token ranges, 
 etc), not data. 
 
 On 05/17/2013 10:34 AM, Sergey Naumov wrote:
 If I understand you correctly, GossipingPropertyFileSnitch is useful for 
 manipulations with nodes within a single DC, but to add a new DC without 
 having to restart every node in all DCs (because seeds are specified in 
 cassandra.yaml and I need to restart a node after addition of a new seed 
 from newly created DC), I anyway have to use cassandra-topology.properties 
 and edit it on every node of a cluster.
 
 By the way, it it necessary to specify seeds if I use PropertyFileSnitch 
 and there is info in cassandra-topology.properties about all nodes of a 
 cluster?
 
 
 Yes, it is. Cassandra need seed(s), because topology properties have no info 
 about token ranges.
 
 
 
 2013/5/17 aaron morton aa...@thelastpickle.com
 You should configure the seeds as recommended regardless of the snitch used.
 
 You need to update the yaml file to start using the 
 GossipingPropertyFileSnitch but after that it reads the 
 cassandra-rackdc.properties file to get information about the node. It 
 reads uses the information in gossip to get information about the other 
 nodes in the cluster.
 
 If there is no info in gossip about a remote node, because say it has not 
 been upgraded, it will fall back to using cassandra-topology.properties.
 
 Hope that helps.
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 15/05/2013, at 8:10 PM, Sergey Naumov sknau...@gmail.com wrote:
 
  As far as I understand, GossipingPropertyFileSnitch supposed to provide 
  more flexibility in nodes addition/removal. But what about addition of a 
  DC? In datastax documentation 
  (http://www.datastax.com/docs/1.2/operations/add_replace_nodes#add-dc) it 
  is said that cassandra-topology.properties could be updated without 
  restart for PropertyFileSnitch. But here 
  (http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) it it 
  said, that you MUST include at least one node from EACH data center. It 
  is a best practice to have at more than one seed node per data center and 
  the seed list should be the same for each node. At the first glance it 
  seems that PropertyFileSnitch will get necessary info from 
  cassandra-topology.properties, but for GossipingPropertyFileSnitch 
  modification of cassandra.yaml and restart of all nodes in all DCs will 
  be required. Could somebody clarify this topic?
 
  Thanks in advance,
  Sergey Naumov.
 
 
 
 
 



Re: Logging Cassandra queries

2013-05-17 Thread aaron morton
 And... could I be more precise when enabling logging? Because right now, with 
 log4j.rootLogger=DEBUG,stdout,R I'm getting a lot of information I won't use 
 ever, and I'd like to enable just what I need to see gets and seds….

see the example at the bottom of this file about setting the log level for a 
single class 
https://github.com/apache/cassandra/blob/trunk/conf/log4j-server.properties

You probably want to set it for the org.apache.cassandra.thrift.CassandraServer 
class. But I cannot remember what the logging is like in 0.8. 

Cassandra gets faster in the later versions, which normally means doing less 
work. Upgrading to 1.1 would be the first step I would take in improving 
performance.  

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/05/2013, at 4:00 AM, Tomàs Núnez tomas.nu...@groupalia.com wrote:

 Hi!
 
 For quite time I've been having some unexpected loadavg in the cassandra 
 servers. I suspect there are lots of uncontrolled queries to the cassandra 
 servers causing this load, but the developers say that there are none, and 
 the load is due to cassandra internal processes. 
 
 Trying to get to the bottom, I've been looking into completed ReadStage and 
 MutationStage through JMX, and the numbers seem to confirm my theory, but I'd 
 like to go one step forward and, if possible, list all the queries from the 
 webservers to the cassandra cluster (just one node would be enough). 
 
 I've been playing with cassandra loglevels, and I can see when a Read or a 
 Write is done, but it would be better if I could knew the CF of the query. 
 For my tests I've put the in the log4j.server  
 log4j.rootLogger=DEBUG,stdout,R, writing and reading a test CF, and I can't 
 see the name of it anywhere.
 
 For the test I'm using Cassandra 0.8.4 (yes, still), as my production 
 servers, and also 1.0.11. Maybe this changes in 1.1? Maybe I'm doing 
 something wrong? Any hint?
 
 And... could I be more precise when enabling logging? Because right now, with 
 log4j.rootLogger=DEBUG,stdout,R I'm getting a lot of information I won't use 
 ever, and I'd like to enable just what I need to see gets and seds
 
 Thanks in advance, 
 Tomàs
 



Re: update does not apply to any replica if consistency = ALL and one replica is down

2013-05-17 Thread aaron morton
  one node in DC1 is deliberately down - and, as RF for DC1 is 3, this down 
 node is a replica node for 100% of records),  when I try to insert one record 
 with consistency level of ALL, this insert does not appear on any replica 
This insert will fail to start and the client will get an UnavailableException. 

You are asking for ALL replicas to be available but have disabled one. It's 
easier to use QUOURM and QUOURM. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/05/2013, at 6:30 AM, Bryan Talbot btal...@aeriagames.com wrote:

 I think you're conflating may with must.  That article says that updates 
 may still be applied to some replicas when there is a failure and I believe 
 that still is the case.  However, if the coordinator knows that the CL can't 
 be met before even attempting the write, I don't think it will attempt the 
 write.
 
 -Bryan
 
 
 
 On Fri, May 17, 2013 at 1:48 AM, Sergey Naumov sknau...@gmail.com wrote:
 As described here 
 (http://maxgrinev.com/2010/07/12/update-idempotency-why-it-is-important-in-cassandra-applications-2/),
  if consistency level couldn't be met, updates are applied anyway on 
 functional replicas, and they could be propagated later to other replicas 
 using repair mechanisms or by issuing the same request later, as update 
 operations are idempotent in Cassandra.
 
 But... on my configuration (Cassandra 1.2.4, python CQL 1.0.4, DC1 - 3 nodes, 
 DC2 - 3 nodes, DC3 - 1 node, RF={DC1:3, DC2:2, DC3:1}, Random Partitioner, 
 GossipingPropertyFileSnitch, one node in DC1 is deliberately down - and, as 
 RF for DC1 is 3, this down node is a replica node for 100% of records),  when 
 I try to insert one record with consistency level of ALL, this insert does 
 not appear on any replica (-s30 - is a serial of UUID1: 
 001e--1000--x (30 is 1e in hex), -n1 mean that we 
 will insert/update a single record with first id from this series - 
 001e--1000--):
 write with consistency ALL:
 cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cALL
 Traceback (most recent call last):
   File ./aux/fastinsert.py, line 54, in insert
 curs.execute(cmd, consistency_level=p.conlvl)
 OperationalError: Unable to complete request: one or more nodes were 
 unavailable.
 Last record UUID is 001e--1000--
 
 about 10 seconds passed...
 
 read with consistency ONE:
 cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cONE
 Total records read: 0
 Last record UUID is 001e--1000--
 read with consistency QUORUM:
 cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
 Total records read: 0
 Last record UUID is 001e--1000--
 write with consistency QUORUM:
 cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cQUORUM
 Last record UUID is 001e--1000--
 read with consistency QUORUM:
 cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
 Total records read: 1
 Last record UUID is 001e--1000--
 
 Is it a new feature of Cassandra that it does not perform a write to any 
 replica if consistency couldn't be satisfied? If so, then is it true for all 
 cases, for example when returned error is OperationalError: Request did not 
 complete within rpc_timeout?
 
 Thanks in advance,
 Sergey Naumov.
 



Re: C language - cassandra

2013-05-17 Thread aaron morton
Mina, 
Could you update this page with your client library ? 
https://wiki.apache.org/cassandra/ClientOptions

Thanks
Aaron

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/05/2013, at 6:00 AM, Mina Naguib mina.nag...@adgear.com wrote:

 
 Hi Apostolis
 
 I'm the author of libcassie, a C library for cassandra that wraps the C++ 
 libcassandra library.  
 
 It's in use in production where I work, however it has not received much 
 traction elsewhere as far as I know.  You can get it here:
 https://github.com/minaguib/libcassandra/tree/kickstart-libcassie-0.7
 
 It has not been updated for a while (for example no CQL support, no pooling 
 support).  I've been waiting for either the thrift C-glibc interface to 
 mature, or the thriftless-CQL-binary protocol to mature, before putting 
 effort into updating/rewriting it.  It might however satisfy your needs with 
 its current functionality.
 
 
 
 On 2013-05-17, at 10:42 AM, Apostolis Xekoukoulotakis xekou...@gmail.com 
 wrote:
 
 Hello, new here, What are my options in using cassandra from a program 
 written in c?
 
 A)
 Thrift has no documentation, so it will take me time to understand.
 Thrift also doesnt have a balancing pool, asking different nodes every time, 
 which is a big problem.
 
 B)
 Should I use the hector (java) client and then send the data to my program 
 with my own protocol? 
 Seems a lot of unnecessary work.
 
 Any other suggestions?
 
 
 -- 
 
 Sincerely yours, 
  Apostolis Xekoukoulotakis
 



Re: best practices on EC2 question

2013-05-17 Thread aaron morton
I was considering that when bootstrapping starts the nodes receive writes so 
that when the process is complete they have both the data from the streaming 
process and all writes from the time they started. So that a repair is not 
needed. Compared to bootstrapping a node from a backup where a (non -pr) repair 
is needed on the node to achieve consistency. In that sense the node as all 
it's data when the bootstrap has finished. 

If there is data that is replicated to a single node there is always a risk of 
data loss. The data could have been written in the time between the last backup 
and the node failing. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/05/2013, at 6:32 AM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, May 17, 2013 at 11:13 AM, aaron morton aa...@thelastpickle.com 
 wrote:
 Bootstrapping a new node into the cluster has a small impact on the existing
 nodes and the new nodes to have all the data they need when the finish the
 process.
 
 Sorry for the pedantry, but bootstrapping from existing replicas
 cannot guarantee that the new nodes have all the data they need when
 they finish the process. There is a non-zero chance that the failed
 node contained the single under-replicated copy of a given datum. In
 practice if your RF is =2, you are unlikely to experience this type
 of data loss. But restore-a-backup-then-repair protects you against
 this unlikely case.
 
 =Rob



Re: C++ Thrift client

2013-05-17 Thread Víctor Hugo Oliveira Molinar
Aaron, whenever I get a GCInspector event log, will it means that I'm
having a GC pause?



*Atenciosamente,*
*Víctor Hugo Molinar - *@vhmolinar http://twitter.com/#!/vhmolinar


On Thu, May 16, 2013 at 8:53 PM, aaron morton aa...@thelastpickle.comwrote:

 (Assuming you have enabled tcp_nodelay on the client socket)

 Check the server side latency, using nodetool cfstats or nodetool
 cfhistograms.

 Check the logs for messages from the GCInspector about ParNew pauses.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 16/05/2013, at 12:58 PM, Bill Hastings bllhasti...@gmail.com wrote:

 Hi All

 I am doing very small inserts into Cassandra in the range of say 64
 bytes. I use a C++ Thrift client and seem consistently get latencies
 anywhere between 35-45 ms. Could some one please advise as to what
 might be happening?

 thanks





Re: pycassa failures in large batch cycling

2013-05-17 Thread John R. Frank
IMHO you are going to have more success breaking up your work load to 
work with the current settings.  The buffers created by thrift are going 
to eat up the server side memory. They grow dynamically but persist for 
the life of the connection. 


Amen to that.  Already refactoring our workload to minimize record sizes.

Smaller fields means more of them, so batched inserts are even more useful 
compared to many unbatched inserts.


IMO there is still a serious bug: even with smaller individual records, it 
is trivially easy to put too many small records into a batch_mutate. 
Right now, clients like pycassa, and I imagine others, are forced into an 
infinite retry loop under the hood because the thrift exception is 
indistinguishable from the server crashing --- the application layer has 
no recourse.


I'd love to see a work around that still has the benefit of grouping 
together many inserts.



John

Re: C++ Thrift client

2013-05-17 Thread Sorin Manolache

On 2013-05-16 02:58, Bill Hastings wrote:

Hi All

I am doing very small inserts into Cassandra in the range of say 64
bytes. I use a C++ Thrift client and seem consistently get latencies
anywhere between 35-45 ms. Could some one please advise as to what
might be happening?


Sniff the network traffic in order to check whether you use the same 
connection or you open a new connection for each new insert.


Also check if the client does a set_keyspace (or use keyspace) before 
every insert. That would be wasteful too.


In the worst case, the client would perform an authentication too.

Inspect timestamps of the network packets in the capture file in order 
to determine which part takes too long: the connection phase? The 
authentication? The interval between sending the request and getting the 
response?


I do something similar (C++ Thrift, small inserts of roughly the same 
size as you) and I get response times of 100ms for the first request 
when opening the connection, authentifying, and setting the keyspace. 
But subsequent requests on the same connection have response times in 
the range of 8-11ms.


Sorin



Re: C language - cassandra

2013-05-17 Thread Sorin Manolache

On 2013-05-17 16:42, Apostolis Xekoukoulotakis wrote:

Hello, new here, What are my options in using cassandra from a program
written in c?

A)
Thrift has no documentation, so it will take me time to understand.
Thrift also doesnt have a balancing pool, asking different nodes every
time, which is a big problem.


Thrift has a sort of documentation. Check interface/cassandra.thrift in 
cassandra's source files. The file contains quite thorough comments for 
each method and data structure. Once you've read this file, it is quite 
easy to browse through the Cassandra.h and cassandra_types.h that are 
generated from cassandra.thrift by the thrift compiler.


Sending requests is quite straightforward. Setting up a connection is 
more verbose and, imo, relatively complex.


About pools, you're right. I guess you'll have to write your own.



B)
Should I use the hector (java) client and then send the data to my
program with my own protocol?
Seems a lot of unnecessary work.

Any other suggestions?


I would go for thrift. After digging one or two days you'll have it working.

Sorin




--


Sincerely yours,

  Apostolis Xekoukoulotakis





Re:

2013-05-17 Thread Robert Coli
On Thu, May 16, 2013 at 8:49 PM,  almeida...@yahoo.com wrote:

 hi [attack_url]

Is there anyone taking care of removing these attack spammers from
this list? This is the second such mail in two days.

=Rob