Re: Exploring Simply Queueing

2014-10-06 Thread Jan Algermissen
Chris,

thanks for taking a look.

On 06 Oct 2014, at 04:44, Chris Lohfink clohf...@blackbirdit.com wrote:

 It appears you are aware of the tombstones affect that leads people to label 
 this an anti-pattern.  Without due or any time based value being part of 
 the partition key means you will still get a lot of buildup.  You only have 1 
 partition per shard which just linearly decreases the tombstones.  That isn't 
 likely to be enough to really help in a situation of high queue throughput, 
 especially with the default of 4 shards. 

Yes, dealing with the tombstones effect is the whole point. The work loads I 
have to deal with are not really high throughput, it is unlikely we’ll ever 
reach multiple messages per second.The emphasis is also more on coordinating 
producer and consumer than on high volume capacity problems.

Your comment seems to suggest to include larger time frames (e.g. the due-hour) 
in the partition keys and use the current time to select the active partitions 
(e.g. the shards of the hour). Once an hour has passed, the corresponding 
shards will never be touched again.

Am I understanding this correctly?

 
 You may want to consider switching to LCS from the default STCS since 
 re-writing to same partitions a lot. It will still use STCS in L0 so in high 
 write/delete scenarios, with low enough gc_grace, when it never gets higher 
 then L1 it will be sameish write throughput. In scenarios where you get more 
 LCS will shine I suspect by reducing number of obsolete tombstones.  Would be 
 hard to identify difference in small tests I think.

Thanks, I’ll try to explore the various effects

 
 Whats the plan to prevent two consumers from reading same message off of a 
 queue?  You mention in docs you will address it at a later point in time but 
 its kinda a biggy.  Big lock  batch reads like astyanax recipe?

I have included a static column per shard to act as a lock (the ’lock’ column 
in the examples) in combination with conditional updates.

I must admit, I have not quite understood what Netfix is doing in terms of 
coordination - but since performance isn’t our concern, CAS should do fine, I 
guess(?)

Thanks again,

Jan


 
 ---
 Chris Lohfink
 
 
 On Oct 5, 2014, at 6:03 PM, Jan Algermissen jan.algermis...@nordsc.com 
 wrote:
 
 Hi,
 
 I have put together some thoughts on realizing simple queues with Cassandra.
 
 https://github.com/algermissen/cassandra-ruby-queue
 
 The design is inspired by (the much more sophisticated) Netfilx approach[1] 
 but very reduced.
 
 Given that I am still a C* newbie, I’d be very glad to hear some thoughts on 
 the design path I took.
 
 Jan
 
 [1] https://github.com/Netflix/astyanax/wiki/Message-Queue
 



Re: Increasing size of Batch of prepared statements

2014-10-06 Thread shahab
Thanks Jens for the comment. Actually I am using Cassandra Stress Tool and
this is the tools who inserts such a large statements.

But do you mean that inserting columns with large size (let's say a text
with 20-30 K) is potentially problematic in Cassandra? What shall i do if I
want columns with large size?

best,
/Shahab

On Sun, Oct 5, 2014 at 6:03 PM, Jens Rantil jens.ran...@tink.se wrote:

 Shabab,
 If you are hitting this limit because you are inserting a lot of (CQL)
 rows in a single batch I suggest you split the statement up in multiple
 smaller batches. Generally, large inserts like this will not perform very
 well.

 Cheers,
 Jens

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Fri, Oct 3, 2014 at 6:47 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am getting the following warning in the cassandra log:
  BatchStatement.java:258 - Batch of prepared statements for [mydb.mycf]
 is of size 3272725, exceeding specified threshold of 5120 by 3267605.

 Apparently it relates to the (default) size of prepared  insert statement
 . Is there any way to change the default value?

 thanks
 /Shahab





Re: Exploring Simply Queueing

2014-10-06 Thread Shane Hansen
Sorry if I'm hijacking the conversation, but why in the world would you want
to implement a queue on top of Cassandra? It seems like using a proper
queuing service
would make your life a lot easier.

That being said, there might be a better way to play to the strengths of
C*. Ideally everything you do
is append only with few deletes or updates. So an interesting way to
implement a queue might be
to do one insert to put the job in the queue and another insert to mark the
job as done or in process
or whatever. This would also give you the benefit of being able to replay
the state of the queue.


On Mon, Oct 6, 2014 at 12:57 AM, Jan Algermissen jan.algermis...@nordsc.com
 wrote:

 Chris,

 thanks for taking a look.

 On 06 Oct 2014, at 04:44, Chris Lohfink clohf...@blackbirdit.com wrote:

  It appears you are aware of the tombstones affect that leads people to
 label this an anti-pattern.  Without due or any time based value being
 part of the partition key means you will still get a lot of buildup.  You
 only have 1 partition per shard which just linearly decreases the
 tombstones.  That isn't likely to be enough to really help in a situation
 of high queue throughput, especially with the default of 4 shards.

 Yes, dealing with the tombstones effect is the whole point. The work loads
 I have to deal with are not really high throughput, it is unlikely we’ll
 ever reach multiple messages per second.The emphasis is also more on
 coordinating producer and consumer than on high volume capacity problems.

 Your comment seems to suggest to include larger time frames (e.g. the
 due-hour) in the partition keys and use the current time to select the
 active partitions (e.g. the shards of the hour). Once an hour has passed,
 the corresponding shards will never be touched again.

 Am I understanding this correctly?

 
  You may want to consider switching to LCS from the default STCS since
 re-writing to same partitions a lot. It will still use STCS in L0 so in
 high write/delete scenarios, with low enough gc_grace, when it never gets
 higher then L1 it will be sameish write throughput. In scenarios where you
 get more LCS will shine I suspect by reducing number of obsolete
 tombstones.  Would be hard to identify difference in small tests I think.

 Thanks, I’ll try to explore the various effects

 
  Whats the plan to prevent two consumers from reading same message off of
 a queue?  You mention in docs you will address it at a later point in time
 but its kinda a biggy.  Big lock  batch reads like astyanax recipe?

 I have included a static column per shard to act as a lock (the ’lock’
 column in the examples) in combination with conditional updates.

 I must admit, I have not quite understood what Netfix is doing in terms of
 coordination - but since performance isn’t our concern, CAS should do fine,
 I guess(?)

 Thanks again,

 Jan


 
  ---
  Chris Lohfink
 
 
  On Oct 5, 2014, at 6:03 PM, Jan Algermissen jan.algermis...@nordsc.com
 wrote:
 
  Hi,
 
  I have put together some thoughts on realizing simple queues with
 Cassandra.
 
  https://github.com/algermissen/cassandra-ruby-queue
 
  The design is inspired by (the much more sophisticated) Netfilx
 approach[1] but very reduced.
 
  Given that I am still a C* newbie, I’d be very glad to hear some
 thoughts on the design path I took.
 
  Jan
 
  [1] https://github.com/Netflix/astyanax/wiki/Message-Queue
 




Re: CQL query throws TombstoneOverwhelmingException against a LeveledCompactionStrategy table

2014-10-06 Thread dlu66061
BTW, I am using Cassandra 2.0.6.

Is this the same as  CASSANDRA-6654 (Droppable tombstones are not being
removed from LCS table despite being above 20%)
https://issues.apache.org/jira/browse/CASSANDRA-6654  ? I checked my table
in JConsole and the droppable tombstone ratio of over 60%.

If it is of the same cause, does that mean I should switch to
SizeTieredCompactionStrategy?



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CQL-query-throws-TombstoneOverwhelmingException-against-a-LeveledCompactionStrategy-table-tp7597077p7597091.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Exploring Simply Queueing

2014-10-06 Thread Minh Do
Hi Jan,

Both Chris and Shane say what I believe the correct thinking.

Just let you know if you base your implementation on Netflix's queue
recipe, there are many issues with it.

In general, we don't advise people to use that recipe so I suggest you to
save your time by not going that same route again.


Minh


On Mon, Oct 6, 2014 at 7:34 AM, Shane Hansen shanemhan...@gmail.com wrote:

 Sorry if I'm hijacking the conversation, but why in the world would you
 want
 to implement a queue on top of Cassandra? It seems like using a proper
 queuing service
 would make your life a lot easier.

 That being said, there might be a better way to play to the strengths of
 C*. Ideally everything you do
 is append only with few deletes or updates. So an interesting way to
 implement a queue might be
 to do one insert to put the job in the queue and another insert to mark
 the job as done or in process
 or whatever. This would also give you the benefit of being able to replay
 the state of the queue.


 On Mon, Oct 6, 2014 at 12:57 AM, Jan Algermissen 
 jan.algermis...@nordsc.com wrote:

 Chris,

 thanks for taking a look.

 On 06 Oct 2014, at 04:44, Chris Lohfink clohf...@blackbirdit.com wrote:

  It appears you are aware of the tombstones affect that leads people to
 label this an anti-pattern.  Without due or any time based value being
 part of the partition key means you will still get a lot of buildup.  You
 only have 1 partition per shard which just linearly decreases the
 tombstones.  That isn't likely to be enough to really help in a situation
 of high queue throughput, especially with the default of 4 shards.

 Yes, dealing with the tombstones effect is the whole point. The work
 loads I have to deal with are not really high throughput, it is unlikely
 we’ll ever reach multiple messages per second.The emphasis is also more on
 coordinating producer and consumer than on high volume capacity problems.

 Your comment seems to suggest to include larger time frames (e.g. the
 due-hour) in the partition keys and use the current time to select the
 active partitions (e.g. the shards of the hour). Once an hour has passed,
 the corresponding shards will never be touched again.

 Am I understanding this correctly?

 
  You may want to consider switching to LCS from the default STCS since
 re-writing to same partitions a lot. It will still use STCS in L0 so in
 high write/delete scenarios, with low enough gc_grace, when it never gets
 higher then L1 it will be sameish write throughput. In scenarios where you
 get more LCS will shine I suspect by reducing number of obsolete
 tombstones.  Would be hard to identify difference in small tests I think.

 Thanks, I’ll try to explore the various effects

 
  Whats the plan to prevent two consumers from reading same message off
 of a queue?  You mention in docs you will address it at a later point in
 time but its kinda a biggy.  Big lock  batch reads like astyanax recipe?

 I have included a static column per shard to act as a lock (the ’lock’
 column in the examples) in combination with conditional updates.

 I must admit, I have not quite understood what Netfix is doing in terms
 of coordination - but since performance isn’t our concern, CAS should do
 fine, I guess(?)

 Thanks again,

 Jan


 
  ---
  Chris Lohfink
 
 
  On Oct 5, 2014, at 6:03 PM, Jan Algermissen jan.algermis...@nordsc.com
 wrote:
 
  Hi,
 
  I have put together some thoughts on realizing simple queues with
 Cassandra.
 
  https://github.com/algermissen/cassandra-ruby-queue
 
  The design is inspired by (the much more sophisticated) Netfilx
 approach[1] but very reduced.
 
  Given that I am still a C* newbie, I’d be very glad to hear some
 thoughts on the design path I took.
 
  Jan
 
  [1] https://github.com/Netflix/astyanax/wiki/Message-Queue
 





Re: Exploring Simply Queueing

2014-10-06 Thread Robert Coli
On Mon, Oct 6, 2014 at 8:30 AM, Minh Do m...@netflix.com wrote:

 Just let you know if you base your implementation on Netflix's queue
 recipe, there are many issues with it.

 In general, we don't advise people to use that recipe so I suggest you to
 save your time by not going that same route again.


I +1 people who are saying that this is not a strong case for Cassandra. I
also agree that if you want to do this, you should consider other
approaches.

However, depending on the nature of the queue (low amount of total volume,
etc.) things like this can work just fine in practice :

https://engineering.eventbrite.com/replayable-pubsub-queues-with-cassandra-and-zookeeper/

In theory they can also be designed such that history is not infinite,
which mitigates the buildup of old queue state.

=Rob
http://twitter.com/rcolidba


ConnectionException while trying to connect with Astyanax over Java driver

2014-10-06 Thread Ruchir Jha
All,

I am trying to use the new astyanax over java driver to connect to
cassandra version 1.2.12,

Following settings are turned on in cassandra.yaml:

start_rpc: true
native_transport_port: 9042
start_native_transport: true

*Code to connect:*

final SupplierListHost hostSupplier = new SupplierListHost() {

@Override
public ListHost get()
{
ListHost hosts = new ArrayList();
for(String hostPort :
StringUtil.getSetFromDelimitedString(seedHosts, ,))
{
String[] pair = hostPort.split(:);
Host host = new Host(pair[0],
Integer.valueOf(pair[1]).intValue());
host.setRack(rack1);
hosts.add(host);
}
return hosts;
}
};

// get keyspace
AstyanaxContextKeyspace context = new AstyanaxContext.Builder()
.forCluster(clusterName)
.forKeyspace(keyspace)
.withHostSupplier(hostSupplier)
.withAstyanaxConfiguration(
new AstyanaxConfigurationImpl()

.setDiscoveryType(NodeDiscoveryType.DISCOVERY_SERVICE)

.setDiscoveryDelayInSeconds(6).setCqlVersion(3.0.0).setTargetCassandraVersion(1.2.12)
)
.withConnectionPoolConfiguration(
new *JavaDriverConfigBuilder*().withPort(9042)
.build())
.buildKeyspace(CqlFamilyFactory.getInstance());

context.start();

*Exception in Cassandra Server logs:*

 WARN [New I/O server boss #1 ([id: 0x6815d6c5, /0.0.0.0:9042])] 2014-10-06
11:11:37,826 Slf4JLogger.java (line 82) Failed to accept a connection.
java.lang.NoSuchMethodError:
org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.init(IZ)V
at
org.apache.cassandra.transport.Frame$Decoder.init(Frame.java:147)
at
org.apache.cassandra.transport.Server$PipelineFactory.getPipeline(Server.java:232)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.registerAcceptedChannel(NioServerSocketPipelineSink.java:276)
at
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:246)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)


I also tried using the Java Driver 2.1.1, but I see the
NoHostAvailableException, and I feel the underlying reason is the same as
during connecting with astyanax java driver.


Re: IN versus multiple asynchronous queries

2014-10-06 Thread Robert Wille
As far as latency is concerned, it seems like it wouldn't matter very much if 
the coordinator has to wait for all the responses to come back, or the client 
waits for all the responses to come back. I’ve got the same latency either way.

I would assume that 50 coordinations is more expensive than one coordination 
that does 50 times the work, but that’s probably insignificant when compared to 
the actual fetching of the data from the SSTables.

I do see the point about putting stress on coordinator memory. In general, the 
documents will be very small, but there will occasionally be some rather large 
ones, potentially several megabytes in size. Definitely better to not make the 
coordinator hold on to that memory while it waits for other requests to come 
back.

Robert

On Oct 4, 2014, at 8:34 AM, DuyHai Doan 
doanduy...@gmail.commailto:doanduy...@gmail.com wrote:

Definitely 50 concurrent queries, possibly in async mode.

If you're using the IN clause with 50 values, the coordinator will block, 
waiting for 50 partitions to be fetched from different nodes (worst case = 50 
nodes) before responding to client. In addition to the very  high latency, 
you'll put the stress on the coordinator memory.



On Sat, Oct 4, 2014 at 3:09 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
I have a table of small documents (less than 1K) that are often accessed 
together as a group. The group size is always less than 50. Which produces less 
load on the server, one query using an IN clause to get all 50 back together, 
or 50 concurrent queries? Which one is fastest?

Thanks

Robert





assertion error on joining

2014-10-06 Thread Kais Ahmed
Hi all,

I'm a bit stuck , i want to expand my cluster C* 2.0.6 but i encountered an
error on
the new node.

ERROR [FlushWriter:2] 2014-10-06 16:15:35,147 CassandraDaemon.java (line
199) Exception in thread Thread[FlushWriter:2,5,main]
java.lang.AssertionError: 394920
at
org.apache.cassandra.utils.ByteBufferUtil.writeWithShortLength(ByteBufferUtil.java:342)
at
org.apache.cassandra.db.ColumnIndex$Builder.maybeWriteRowHeader(ColumnIndex.java:201)
at
org.apache.cassandra.db.ColumnIndex$Builder.add(ColumnIndex.java:188)
at
org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:133)
at
org.apache.cassandra.io.sstable.SSTableWriter.rawAppend(SSTableWriter.java:202)
at
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:187)
...

This assertion is here :

public static void writeWithShortLength(ByteBuffer buffer, DataOutput
out) throws IOException
{
int length = buffer.remaining();
 -- assert 0 = length  length = FBUtilities.MAX_UNSIGNED_SHORT :
length;
out.writeShort(length);
write(buffer, out); // writing data bytes to output source
}

But i dont know what i can do to complete the bootstrap.

Thanks,


Re: ConnectionException while trying to connect with Astyanax over Java driver

2014-10-06 Thread DuyHai Doan
java.lang.NoSuchMethodError - Jar dependency issue probably. Did you try
to create an issue on the Astyanax github repo ?

On Mon, Oct 6, 2014 at 6:01 PM, Ruchir Jha ruchir@gmail.com wrote:

 All,

 I am trying to use the new astyanax over java driver to connect to
 cassandra version 1.2.12,

 Following settings are turned on in cassandra.yaml:

 start_rpc: true
 native_transport_port: 9042
 start_native_transport: true

 *Code to connect:*

 final SupplierListHost hostSupplier = new SupplierListHost() {

 @Override
 public ListHost get()
 {
 ListHost hosts = new ArrayList();
 for(String hostPort :
 StringUtil.getSetFromDelimitedString(seedHosts, ,))
 {
 String[] pair = hostPort.split(:);
 Host host = new Host(pair[0],
 Integer.valueOf(pair[1]).intValue());
 host.setRack(rack1);
 hosts.add(host);
 }
 return hosts;
 }
 };

 // get keyspace
 AstyanaxContextKeyspace context = new AstyanaxContext.Builder()
 .forCluster(clusterName)
 .forKeyspace(keyspace)
 .withHostSupplier(hostSupplier)
 .withAstyanaxConfiguration(
 new AstyanaxConfigurationImpl()

 .setDiscoveryType(NodeDiscoveryType.DISCOVERY_SERVICE)

 .setDiscoveryDelayInSeconds(6).setCqlVersion(3.0.0).setTargetCassandraVersion(1.2.12)
 )
 .withConnectionPoolConfiguration(
 new *JavaDriverConfigBuilder*().withPort(9042)
 .build())
 .buildKeyspace(CqlFamilyFactory.getInstance());

 context.start();

 *Exception in Cassandra Server logs:*

  WARN [New I/O server boss #1 ([id: 0x6815d6c5, /0.0.0.0:9042])]
 2014-10-06 11:11:37,826 Slf4JLogger.java (line 82) Failed to accept a
 connection.
 java.lang.NoSuchMethodError:
 org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.init(IZ)V
 at
 org.apache.cassandra.transport.Frame$Decoder.init(Frame.java:147)
 at
 org.apache.cassandra.transport.Server$PipelineFactory.getPipeline(Server.java:232)
 at
 org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.registerAcceptedChannel(NioServerSocketPipelineSink.java:276)
 at
 org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:246)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)


 I also tried using the Java Driver 2.1.1, but I see the
 NoHostAvailableException, and I feel the underlying reason is the same as
 during connecting with astyanax java driver.




Re: IN versus multiple asynchronous queries

2014-10-06 Thread DuyHai Doan
Definitely better to not make the coordinator hold on to that memory while
it waits for other requests to come back -- You get it. When loading big
documents, you risk starving the heap quickly, triggering long GC cycle on
the coordinator etc...

On Mon, Oct 6, 2014 at 6:22 PM, Robert Wille rwi...@fold3.com wrote:

  As far as latency is concerned, it seems like it wouldn't matter very
 much if the coordinator has to wait for all the responses to come back, or
 the client waits for all the responses to come back. I’ve got the same
 latency either way.

  I would assume that 50 coordinations is more expensive than one
 coordination that does 50 times the work, but that’s probably insignificant
 when compared to the actual fetching of the data from the SSTables.

  I do see the point about putting stress on coordinator memory. In
 general, the documents will be very small, but there will occasionally be
 some rather large ones, potentially several megabytes in size. Definitely
 better to not make the coordinator hold on to that memory while it waits
 for other requests to come back.

  Robert

  On Oct 4, 2014, at 8:34 AM, DuyHai Doan doanduy...@gmail.com wrote:

  Definitely 50 concurrent queries, possibly in async mode.

  If you're using the IN clause with 50 values, the coordinator will block,
 waiting for 50 partitions to be fetched from different nodes (worst case =
 50 nodes) before responding to client. In addition to the very  high
 latency, you'll put the stress on the coordinator memory.



 On Sat, Oct 4, 2014 at 3:09 PM, Robert Wille rwi...@fold3.com wrote:

 I have a table of small documents (less than 1K) that are often accessed
 together as a group. The group size is always less than 50. Which produces
 less load on the server, one query using an IN clause to get all 50 back
 together, or 50 concurrent queries? Which one is fastest?

 Thanks

 Robert






RE: Cassandra Data Model design

2014-10-06 Thread Rahul Gupta
You need rethink your data model for client_data table.
Unlike RDBMS, Cassandra heavily relies on Primary Key for filtering data.

In fact using any column other than primary key is not recommended when you are 
using Cassandra.
This means that how you design your Primary Key is critical.

There are two options in this case:


1.   Use both client_name and is_valid as Row Key

2.   Use client_name as Row Key and is_valid as partitioning key or in 
other words, make a composite key using client_name and is_valid

Cassandra Data Model Rule: You need to know your query patterns before you 
create a table.

Rahul Gupta

From: Check Peck [mailto:comptechge...@gmail.com]
Sent: Wednesday, September 17, 2014 4:01 PM
To: user
Subject: Cassandra Data Model design

I have recently started working with Cassandra. We have cassandra cluster which 
is using DSE 4.0 version and has VNODES enabled. We have a tables like this -

Below is my first table -

CREATE TABLE customers (
  customer_id int PRIMARY KEY,
  last_modified_date timeuuid,
  customer_value text
)

Read query pattern is like this on above table as of now since we need to get 
everything from above table and load it into our application memory every x 
minutes.

select customer_id, customer_value from datakeyspace.customers;

We have second table like this -

CREATE TABLE client_data (
  client_name text PRIMARY KEY,
  client_id text,
  creation_date timestamp,
  is_valid int,
  last_modified_date timestamp
)

Right now in the above table, we have 500 records and all those records has 
is_valid column value set as 1. And the read query pattern is like this on 
above table as of now since we need to get everything from above table and load 
it into our application memory every x minutes so the below query will return 
me all 500 records since everything has is_valid set to 1.

select client_name, client_id from  datakeyspace.client_data where 
is_valid=1;

Since our cluster is VNODES enabled so my above query pattern is not efficient 
at all and it is taking lot of time to get the data from Cassandra. We are 
reading from these table with consistency level QUORUM.

Is there any possibility of improving our data model?

Any suggestions will be greatly appreciated.


Click 
herehttps://www.mailcontrol.com/sr/kSV!iHJdoezGX2PQPOmvUgEBY15Clgt1yZCwVg0S2deEmu+55HoGlTWtq8oOngZ2yx9zvjq!hshkxH4nYzTQYQ==
 to report this email as spam.


This e-mail and the information, including any attachments it contains, are 
intended to be a confidential communication only to the person or entity to 
whom it is addressed and may contain information that is privileged. If the 
reader of this message is not the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you have received this communication in error, please 
immediately notify the sender and destroy the original message.

Thank you.

Please consider the environment before printing this email.


Re: ConnectionException while trying to connect with Astyanax over Java driver

2014-10-06 Thread Ruchir Jha
That exception is on the cassandra server and not on the client.

On Mon, Oct 6, 2014 at 2:10 PM, DuyHai Doan doanduy...@gmail.com wrote:

 java.lang.NoSuchMethodError - Jar dependency issue probably. Did you try
 to create an issue on the Astyanax github repo ?

 On Mon, Oct 6, 2014 at 6:01 PM, Ruchir Jha ruchir@gmail.com wrote:

 All,

 I am trying to use the new astyanax over java driver to connect to
 cassandra version 1.2.12,

 Following settings are turned on in cassandra.yaml:

 start_rpc: true
 native_transport_port: 9042
 start_native_transport: true

 *Code to connect:*

 final SupplierListHost hostSupplier = new SupplierListHost() {

 @Override
 public ListHost get()
 {
 ListHost hosts = new ArrayList();
 for(String hostPort :
 StringUtil.getSetFromDelimitedString(seedHosts, ,))
 {
 String[] pair = hostPort.split(:);
 Host host = new Host(pair[0],
 Integer.valueOf(pair[1]).intValue());
 host.setRack(rack1);
 hosts.add(host);
 }
 return hosts;
 }
 };

 // get keyspace
 AstyanaxContextKeyspace context = new AstyanaxContext.Builder()
 .forCluster(clusterName)
 .forKeyspace(keyspace)
 .withHostSupplier(hostSupplier)
 .withAstyanaxConfiguration(
 new AstyanaxConfigurationImpl()

 .setDiscoveryType(NodeDiscoveryType.DISCOVERY_SERVICE)

 .setDiscoveryDelayInSeconds(6).setCqlVersion(3.0.0).setTargetCassandraVersion(1.2.12)
 )
 .withConnectionPoolConfiguration(
 new *JavaDriverConfigBuilder*().withPort(9042)
 .build())
 .buildKeyspace(CqlFamilyFactory.getInstance());

 context.start();

 *Exception in Cassandra Server logs:*

  WARN [New I/O server boss #1 ([id: 0x6815d6c5, /0.0.0.0:9042])]
 2014-10-06 11:11:37,826 Slf4JLogger.java (line 82) Failed to accept a
 connection.
 java.lang.NoSuchMethodError:
 org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.init(IZ)V
 at
 org.apache.cassandra.transport.Frame$Decoder.init(Frame.java:147)
 at
 org.apache.cassandra.transport.Server$PipelineFactory.getPipeline(Server.java:232)
 at
 org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.registerAcceptedChannel(NioServerSocketPipelineSink.java:276)
 at
 org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:246)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)


 I also tried using the Java Driver 2.1.1, but I see the
 NoHostAvailableException, and I feel the underlying reason is the same as
 during connecting with astyanax java driver.





Re: Exploring Simply Queueing

2014-10-06 Thread Jan Algermissen
Shane,

On 06 Oct 2014, at 16:34, Shane Hansen shanemhan...@gmail.com wrote:

 Sorry if I'm hijacking the conversation, but why in the world would you want
 to implement a queue on top of Cassandra? It seems like using a proper 
 queuing service
 would make your life a lot easier.

Agreed - however, the use case simply does not justify the additional 
operations.

 
 That being said, there might be a better way to play to the strengths of C*. 
 Ideally everything you do
 is append only with few deletes or updates. So an interesting way to 
 implement a queue might be
 to do one insert to put the job in the queue and another insert to mark the 
 job as done or in process
 or whatever. This would also give you the benefit of being able to replay the 
 state of the queue.

Thanks, I’ll try that, too.

Jan


 
 
 On Mon, Oct 6, 2014 at 12:57 AM, Jan Algermissen jan.algermis...@nordsc.com 
 wrote:
 Chris,
 
 thanks for taking a look.
 
 On 06 Oct 2014, at 04:44, Chris Lohfink clohf...@blackbirdit.com wrote:
 
  It appears you are aware of the tombstones affect that leads people to 
  label this an anti-pattern.  Without due or any time based value being 
  part of the partition key means you will still get a lot of buildup.  You 
  only have 1 partition per shard which just linearly decreases the 
  tombstones.  That isn't likely to be enough to really help in a situation 
  of high queue throughput, especially with the default of 4 shards.
 
 Yes, dealing with the tombstones effect is the whole point. The work loads I 
 have to deal with are not really high throughput, it is unlikely we’ll ever 
 reach multiple messages per second.The emphasis is also more on coordinating 
 producer and consumer than on high volume capacity problems.
 
 Your comment seems to suggest to include larger time frames (e.g. the 
 due-hour) in the partition keys and use the current time to select the active 
 partitions (e.g. the shards of the hour). Once an hour has passed, the 
 corresponding shards will never be touched again.
 
 Am I understanding this correctly?
 
 
  You may want to consider switching to LCS from the default STCS since 
  re-writing to same partitions a lot. It will still use STCS in L0 so in 
  high write/delete scenarios, with low enough gc_grace, when it never gets 
  higher then L1 it will be sameish write throughput. In scenarios where you 
  get more LCS will shine I suspect by reducing number of obsolete 
  tombstones.  Would be hard to identify difference in small tests I think.
 
 Thanks, I’ll try to explore the various effects
 
 
  Whats the plan to prevent two consumers from reading same message off of a 
  queue?  You mention in docs you will address it at a later point in time 
  but its kinda a biggy.  Big lock  batch reads like astyanax recipe?
 
 I have included a static column per shard to act as a lock (the ’lock’ column 
 in the examples) in combination with conditional updates.
 
 I must admit, I have not quite understood what Netfix is doing in terms of 
 coordination - but since performance isn’t our concern, CAS should do fine, I 
 guess(?)
 
 Thanks again,
 
 Jan
 
 
 
  ---
  Chris Lohfink
 
 
  On Oct 5, 2014, at 6:03 PM, Jan Algermissen jan.algermis...@nordsc.com 
  wrote:
 
  Hi,
 
  I have put together some thoughts on realizing simple queues with 
  Cassandra.
 
  https://github.com/algermissen/cassandra-ruby-queue
 
  The design is inspired by (the much more sophisticated) Netfilx 
  approach[1] but very reduced.
 
  Given that I am still a C* newbie, I’d be very glad to hear some thoughts 
  on the design path I took.
 
  Jan
 
  [1] https://github.com/Netflix/astyanax/wiki/Message-Queue
 
 
 



Re: Exploring Simply Queueing

2014-10-06 Thread Jan Algermissen
Robert,

On 06 Oct 2014, at 17:50, Robert Coli rc...@eventbrite.com wrote:

 In theory they can also be designed such that history is not infinite, which 
 mitigates the buildup of old queue state.
 

Hmm, I was under the impression that issues with old queue state disappear 
after gc_grace_seconds and that the goal primarily is to keep the rows ‘short’ 
enough to achieve a tombstones read performance impact that one can live with 
in a given use case.

Is that understanding wrong?

Jan




Re: Exploring Simply Queueing

2014-10-06 Thread Ranjib Dey
i want answer the first question why one might use cassandra as a queuing
solution:
 - its the only opensource distributed persistence layer (i.e. no SPOF),
that you can run over WAN and provide lan/wan specific quorum controls
i know its sub optimal, as the deletion imposes additional
compaction/repair penalties, but there no other solution i am awaee of.
Most AMQP solutions are broker based and clustering is pain, while things
like riak only supports wan based cluster in their commercial solution. I
would love to know about other alternatives,

And thaks for sharing the ruby based priority queue prototype, it helps
people like me (sys ad :-) ) exploring these concepts betrter,

cheers
ranjib

On Mon, Oct 6, 2014 at 1:35 PM, Jan Algermissen jan.algermis...@nordsc.com
wrote:

 Shane,

 On 06 Oct 2014, at 16:34, Shane Hansen shanemhan...@gmail.com wrote:

  Sorry if I'm hijacking the conversation, but why in the world would you
 want
  to implement a queue on top of Cassandra? It seems like using a proper
 queuing service
  would make your life a lot easier.

 Agreed - however, the use case simply does not justify the additional
 operations.

 
  That being said, there might be a better way to play to the strengths of
 C*. Ideally everything you do
  is append only with few deletes or updates. So an interesting way to
 implement a queue might be
  to do one insert to put the job in the queue and another insert to mark
 the job as done or in process
  or whatever. This would also give you the benefit of being able to
 replay the state of the queue.

 Thanks, I’ll try that, too.

 Jan


 
 
  On Mon, Oct 6, 2014 at 12:57 AM, Jan Algermissen 
 jan.algermis...@nordsc.com wrote:
  Chris,
 
  thanks for taking a look.
 
  On 06 Oct 2014, at 04:44, Chris Lohfink clohf...@blackbirdit.com
 wrote:
 
   It appears you are aware of the tombstones affect that leads people to
 label this an anti-pattern.  Without due or any time based value being
 part of the partition key means you will still get a lot of buildup.  You
 only have 1 partition per shard which just linearly decreases the
 tombstones.  That isn't likely to be enough to really help in a situation
 of high queue throughput, especially with the default of 4 shards.
 
  Yes, dealing with the tombstones effect is the whole point. The work
 loads I have to deal with are not really high throughput, it is unlikely
 we’ll ever reach multiple messages per second.The emphasis is also more on
 coordinating producer and consumer than on high volume capacity problems.
 
  Your comment seems to suggest to include larger time frames (e.g. the
 due-hour) in the partition keys and use the current time to select the
 active partitions (e.g. the shards of the hour). Once an hour has passed,
 the corresponding shards will never be touched again.
 
  Am I understanding this correctly?
 
  
   You may want to consider switching to LCS from the default STCS since
 re-writing to same partitions a lot. It will still use STCS in L0 so in
 high write/delete scenarios, with low enough gc_grace, when it never gets
 higher then L1 it will be sameish write throughput. In scenarios where you
 get more LCS will shine I suspect by reducing number of obsolete
 tombstones.  Would be hard to identify difference in small tests I think.
 
  Thanks, I’ll try to explore the various effects
 
  
   Whats the plan to prevent two consumers from reading same message off
 of a queue?  You mention in docs you will address it at a later point in
 time but its kinda a biggy.  Big lock  batch reads like astyanax recipe?
 
  I have included a static column per shard to act as a lock (the ’lock’
 column in the examples) in combination with conditional updates.
 
  I must admit, I have not quite understood what Netfix is doing in terms
 of coordination - but since performance isn’t our concern, CAS should do
 fine, I guess(?)
 
  Thanks again,
 
  Jan
 
 
  
   ---
   Chris Lohfink
  
  
   On Oct 5, 2014, at 6:03 PM, Jan Algermissen 
 jan.algermis...@nordsc.com wrote:
  
   Hi,
  
   I have put together some thoughts on realizing simple queues with
 Cassandra.
  
   https://github.com/algermissen/cassandra-ruby-queue
  
   The design is inspired by (the much more sophisticated) Netfilx
 approach[1] but very reduced.
  
   Given that I am still a C* newbie, I’d be very glad to hear some
 thoughts on the design path I took.
  
   Jan
  
   [1] https://github.com/Netflix/astyanax/wiki/Message-Queue
  
 
 




Bitmaps

2014-10-06 Thread Eduardo Cusa
Hi Guys, what data type recommend to store bitmaps?
I am planning to store maps of 90,000,000 length and then query by key.

Example:

key : 22_ES
bitmap : 10101101010111010101011



Thanks
Eduardo


Re: Exploring Simply Queueing

2014-10-06 Thread Robert Coli
On Mon, Oct 6, 2014 at 1:40 PM, Jan Algermissen jan.algermis...@nordsc.com
wrote:

 Hmm, I was under the impression that issues with old queue state disappear
 after gc_grace_seconds and that the goal primarily is to keep the rows
 ‘short’ enough to achieve a tombstones read performance impact that one can
 live with in a given use case.


The design I pasted does a link to does not include specifics regarding
pruning old history. Yes, you can just delete it, if your system design
doesn't require replay from the start.

=Rob


Re: Bitmaps

2014-10-06 Thread Russell Bradberry
I highly recommend against storing data structures like this in C*. That
really isn't it's sweet spot.  For instance, if you were to use the blob
type which will give you the smallest size, you are still looking at a cell
size of (90,000,000/8/1024) = 10,986 or over 10MB in size, which is
prohibitively large.

Additionally, there is no way to modify the bitmap in place, you would have
to read the entire structure out and write it back in.

You could store one bit per cell, but that would essentially defeat the
purpose of the bitmap's compact size.

On Mon, Oct 6, 2014 at 4:46 PM, Eduardo Cusa 
eduardo.c...@usmediaconsulting.com wrote:

 Hi Guys, what data type recommend to store bitmaps?
 I am planning to store maps of 90,000,000 length and then query by key.

 Example:

 key : 22_ES
 bitmap : 10101101010111010101011



 Thanks
 Eduardo





Re: Indexes Fragmentation

2014-10-06 Thread Robert Coli
On Fri, Oct 3, 2014 at 6:03 PM, Arthur Zubarev arthur.zuba...@aol.com
wrote:

 I now see I had misspelled the word tall for toll, anyways, if I
 understood correctly, your reply implies there is no impact whatsoever and
 there is no need to defrug indexes of the frequently changing columns.


Cases with lots of secondary indexes which have a lot of churn are not
well suited for a database with immutable datafiles which wants to be
accessed by Primary Key.

The fragmentation is really bad, because the data files are immutable and
you have a lot of churn. Probably don't do it?

=Rob


Re: Bitmaps

2014-10-06 Thread DuyHai Doan
Isn't there a video of Ooyala at some past Cassandra Summit demonstrating
usage of Cassandra for text search using Trigram ? AFAIK they were storing
kind of bitmap to perform OR  AND operations on trigram

On Mon, Oct 6, 2014 at 10:53 PM, Russell Bradberry rbradbe...@gmail.com
wrote:

 I highly recommend against storing data structures like this in C*. That
 really isn't it's sweet spot.  For instance, if you were to use the blob
 type which will give you the smallest size, you are still looking at a cell
 size of (90,000,000/8/1024) = 10,986 or over 10MB in size, which is
 prohibitively large.

 Additionally, there is no way to modify the bitmap in place, you would
 have to read the entire structure out and write it back in.

 You could store one bit per cell, but that would essentially defeat the
 purpose of the bitmap's compact size.

 On Mon, Oct 6, 2014 at 4:46 PM, Eduardo Cusa 
 eduardo.c...@usmediaconsulting.com wrote:

 Hi Guys, what data type recommend to store bitmaps?
 I am planning to store maps of 90,000,000 length and then query by key.

 Example:

 key : 22_ES
 bitmap : 10101101010111010101011



 Thanks
 Eduardo






Re: Bitmaps

2014-10-06 Thread graham sanderson
You certainly have plenty of freedom to trade off size vs access granularity 
using multiple blobs. It really depends on how mutable the data is, how you 
intend to read it, whether it is highly sparse and or highly dense (in which 
case you perhaps don’t need to store every bit) etc.

On Oct 6, 2014, at 3:56 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Isn't there a video of Ooyala at some past Cassandra Summit demonstrating 
 usage of Cassandra for text search using Trigram ? AFAIK they were storing 
 kind of bitmap to perform OR  AND operations on trigram
 
 On Mon, Oct 6, 2014 at 10:53 PM, Russell Bradberry rbradbe...@gmail.com 
 wrote:
 I highly recommend against storing data structures like this in C*. That 
 really isn't it's sweet spot.  For instance, if you were to use the blob type 
 which will give you the smallest size, you are still looking at a cell size 
 of (90,000,000/8/1024) = 10,986 or over 10MB in size, which is prohibitively 
 large.
 
 Additionally, there is no way to modify the bitmap in place, you would have 
 to read the entire structure out and write it back in.
 
 You could store one bit per cell, but that would essentially defeat the 
 purpose of the bitmap's compact size. 
 
 On Mon, Oct 6, 2014 at 4:46 PM, Eduardo Cusa 
 eduardo.c...@usmediaconsulting.com wrote:
 Hi Guys, what data type recommend to store bitmaps?
 I am planning to store maps of 90,000,000 length and then query by key.
 
 Example:
 
 key : 22_ES
 bitmap : 10101101010111010101011
 
 
 
 Thanks 
 Eduardo
 
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Bitmaps

2014-10-06 Thread Peter Sanford
On Mon, Oct 6, 2014 at 1:56 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Isn't there a video of Ooyala at some past Cassandra Summit demonstrating
 usage of Cassandra for text search using Trigram ? AFAIK they were storing
 kind of bitmap to perform OR  AND operations on trigram


That sounds like the talk Matt Stump gave at the 2013 SF Summit.

Video:  https://www.youtube.com/watch?v=E92u4FXGiAM
Slides: http://www.slideshare.net/planetcassandra/1-matt-stump


Re: Bitmaps

2014-10-06 Thread DuyHai Doan
Yes this one, not Ooyala sorry. Very inventive usage of C* indeed. Thanks
for the links

On Mon, Oct 6, 2014 at 11:01 PM, Peter Sanford psanf...@retailnext.net
wrote:

 On Mon, Oct 6, 2014 at 1:56 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Isn't there a video of Ooyala at some past Cassandra Summit demonstrating
 usage of Cassandra for text search using Trigram ? AFAIK they were storing
 kind of bitmap to perform OR  AND operations on trigram


 That sounds like the talk Matt Stump gave at the 2013 SF Summit.

 Video:  https://www.youtube.com/watch?v=E92u4FXGiAM
 Slides: http://www.slideshare.net/planetcassandra/1-matt-stump



Dynamic schema modification an anti-pattern?

2014-10-06 Thread Todd Fast
There is a team at my work building a entity-attribute-value (EAV) store
using Cassandra. There is a column family, called Entity, where the
partition key is the UUID of the entity, and the columns are the attributes
names with their values. Each entity will contain hundreds to thousands of
attributes, out of a list of up to potentially ten thousand known attribute
names.

However, instead of using wide rows with dynamic columns (and serializing
type info with the value), they are trying to use a static column family
and modifying the schema dynamically as new named attributes are created.

(I believe one of the main drivers of this approach is to use collection
columns for certain attributes, and perhaps to preserve type metadata for a
given attribute.)

This approach goes against everything I've seen and done in Cassandra, and
is generally an anti-pattern for most persistence stores, but I want to
gather feedback before taking the next step with the team.

Do others consider this approach an anti-pattern, and if so, what are the
practical downsides?

For one, this means that the Entity schema would contain the superset of
all columns for all rows. What is the impact of having thousands of columns
names in the schema? And what are the implications of modifying the schema
dynamically on a decent sized cluster (5 nodes now, growing to 10s later)
under load?

Thanks,
Todd