Re: Cassandra Writes Duplicated/Concatenated List Data

2017-08-16 Thread Sagar Jambhulkar
What is your query to fetch rows. Can you share P1,pk2,time for the sample
rows you pasted?

On 17-Aug-2017 2:20 AM, "Nathan McLean"  wrote:

> Hello All,
>
> I have a Cassandra cluster with a table similar to the following:
>
> ```
> CREATE TABLE table (
> pk1 text,
> pk2 int,
> time timestamp,
> ...
> probability list,
> PRIMARY KEY ((pk1, pk2), time)
> ) WITH CLUSTERING ORDER BY (time DESC)
> ```
>
> Python processes write to this table using the DataStax python Cassandra
> driver package. I am occasionally seeing rows written to the table where
> the "probability" column list is the same list, duplicated and concatenated.
>
> e.g.
>
> probability
> ---
> [3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859,
>  3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859]
>
> The code that writes to Cassandra uses "INSERT" statements and validates
> that "probability" lists must always approximately sum to 1.0, so it does
> not seem possible that the python code that writes to Cassandra has a bug
> which is generating this data. The code may occasionally write to the same
> row multiple times.
>
> It appears that there may be a bug in either Cassandra or the python
> driver package which results in this list column being written to and
> appended to with the same data.
>
> Similar invalid data was also generated by a PySpark data migration script
> (using the DataStax spark Cassandra connector) that copied this list data
> to a new table.
>
> Here are the versions of libraries we are using:
>
> Cassandra version 3.6
> Spark version 1.6.0-hadoop2.6
> Python Cassandra driver 3.7.1
> (https://github.com/datastax/python-driver)
>
> Any help/insight into this problem would be greatly appreciated.
>
> Regards,
>
> Nathan
>


Re: Cassandra Writes Duplicated/Concatenated List Data

2017-08-16 Thread Christophe Schmitz
Hi Nathan,


The code may occasionally write to the same row multiple times.
>
>
Can you run a test using IF NOT EXISTS in your inserts to see if that makes
a difference? That shouldn't make a difference, but I don't see what the
problem might be at the moment.


-- 


*Christophe Schmitz**Director of consulting EMEA*


Re: Full table scan with cassandra

2017-08-16 Thread Dor Laor
Hi Alex,

You probably didn't get the paralelism right. Serial scan has
a paralelism of one. If the paralelism isn't large enough, perf will be
slow.
If paralelism is too large, Cassandra and the disk will trash and have too
many context switches.

So you need to find your cluster's sweet spot. We documented the procedure
to do it in this blog:
http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/
and the results are here:
http://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/
The algorithm should translate to Cassandra but you'll have to use
different rules of the thumb.

Best,
Dor


On Wed, Aug 16, 2017 at 9:50 AM, Alex Kotelnikov <
alex.kotelni...@diginetica.com> wrote:

> Hey,
>
> we are trying Cassandra as an alternative for storage huge stream of data
> coming from our customers.
>
> Storing works quite fine, and I started to validate how retrieval does. We
> have two types of that: fetching specific records and bulk retrieval for
> general analysis.
> Fetching single record works like charm. But it is not so with bulk fetch.
>
> With a moderately small table of ~2 million records, ~10Gb raw data I
> observed very slow operation (using token(partition key) ranges). It takes
> minutes to perform full retrieval. We tried a couple of configurations
> using virtual machines, real hardware and overall looks like it is not
> possible to all table data in a reasonable time (by reasonable I mean that
> since we have 1Gbit network 10Gb can be transferred in a couple of minutes
> from one server to another and when we have 10+ cassandra servers and 10+
> spark executors total time should be even smaller).
>
> I tried datastax spark connector. Also I wrote a simple test case using
> datastax java driver and see how fetch of 10k records takes ~10s so I
> assume that "sequential" scan will take 200x more time, equals ~30 minutes.
>
> May be we are totally wrong trying to use Cassandra this way?
>
> --
>
> Best Regards,
>
>
> *Alexander Kotelnikov*
>
> *Team Lead*
>
> DIGINETICA
> Retail Technology Company
>
> m: +7.921.915.06.28 <+7%20921%20915-06-28>
>
> *www.diginetica.com *
>


Re: Full table scan with cassandra

2017-08-16 Thread Ben Bromhead
Apache Cassandra is not great in terms of performance at the moment for
batch analytics workloads that require a full table scan. I would look at
FiloDB for all the benefits and familiarity of Cassandra with better
streaming and analytics performance: https://github.com/filodb/FiloDB

There are also some outstanding tickets around improving bulk reads in
Cassandra (see https://issues.apache.org/jira/browse/CASSANDRA-9259 for the
full gory details), but it appears to be abandonded by the initial set of
contributors.

On Wed, 16 Aug 2017 at 09:51 Alex Kotelnikov 
wrote:

> Hey,
>
> we are trying Cassandra as an alternative for storage huge stream of data
> coming from our customers.
>
> Storing works quite fine, and I started to validate how retrieval does. We
> have two types of that: fetching specific records and bulk retrieval for
> general analysis.
> Fetching single record works like charm. But it is not so with bulk fetch.
>
> With a moderately small table of ~2 million records, ~10Gb raw data I
> observed very slow operation (using token(partition key) ranges). It takes
> minutes to perform full retrieval. We tried a couple of configurations
> using virtual machines, real hardware and overall looks like it is not
> possible to all table data in a reasonable time (by reasonable I mean that
> since we have 1Gbit network 10Gb can be transferred in a couple of minutes
> from one server to another and when we have 10+ cassandra servers and 10+
> spark executors total time should be even smaller).
>
> I tried datastax spark connector. Also I wrote a simple test case using
> datastax java driver and see how fetch of 10k records takes ~10s so I
> assume that "sequential" scan will take 200x more time, equals ~30 minutes.
>
> May be we are totally wrong trying to use Cassandra this way?
>
> --
>
> Best Regards,
>
>
> *Alexander Kotelnikov*
>
> *Team Lead*
>
> DIGINETICA
> Retail Technology Company
>
> m: +7.921.915.06.28 <+7%20921%20915-06-28>
>
> *www.diginetica.com *
>
-- 
Ben Bromhead
CTO | Instaclustr 
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer


Cassandra Writes Duplicated/Concatenated List Data

2017-08-16 Thread Nathan McLean
Hello All,

I have a Cassandra cluster with a table similar to the following:

```
CREATE TABLE table (
pk1 text,
pk2 int,
time timestamp,
...
probability list,
PRIMARY KEY ((pk1, pk2), time)
) WITH CLUSTERING ORDER BY (time DESC)
```

Python processes write to this table using the DataStax python Cassandra
driver package. I am occasionally seeing rows written to the table where
the "probability" column list is the same list, duplicated and concatenated.

e.g.

probability
---
[3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
3.7058e-13, 9.2127e-09, 0.000141, 0.999859,
 3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
3.7058e-13, 9.2127e-09, 0.000141, 0.999859]

The code that writes to Cassandra uses "INSERT" statements and validates
that "probability" lists must always approximately sum to 1.0, so it does
not seem possible that the python code that writes to Cassandra has a bug
which is generating this data. The code may occasionally write to the same
row multiple times.

It appears that there may be a bug in either Cassandra or the python driver
package which results in this list column being written to and appended to
with the same data.

Similar invalid data was also generated by a PySpark data migration script
(using the DataStax spark Cassandra connector) that copied this list data
to a new table.

Here are the versions of libraries we are using:

Cassandra version 3.6
Spark version 1.6.0-hadoop2.6
Python Cassandra driver 3.7.1
(https://github.com/datastax/python-driver)

Any help/insight into this problem would be greatly appreciated.

Regards,

Nathan


Re: live dsc upgrade from 2.0 to 2.1 behind the scenes

2017-08-16 Thread Park Wu
Thank you Eric, for advice. It's great help.-Park 

On Tuesday, August 15, 2017 3:42 AM, Erick Ramirez  
wrote:
 

 1) You should not perform any streaming operations (repair, bootstrap, 
decommission) in the middle of an upgrade. Note that an upgrade is not complete 
until you have completed upgradesstables on all nodes in the cluster.
2) No streaming involved with writes so it's not an issue.
3) It doesn't matter whichever way you do it. My personal experience is that 
it's best to do it as you go but YMMV.
4) It depends on a lot of factors including type of disks (e.g. SSDs vs HDDs), 
data model, access patterns, cluster load, etc. The only way you'll be able to 
estimate it is by running your own tests.
5) There is no "max" time but it is preferable that you complete the upgrades 
in the shortest amount of time. Until you have completed upgradesstables on all 
nodes, there is a performance hit with reading older generations of sstables. 
I'm sure you're about to ask "how much perf hit?" and the answer is "test it".
6) It is not advisable to perform schema changes in mixed-mode -- the schema 
version on upgraded nodes are different and there will be a mismatch until all 
nodes are upgraded.
Good luck!
On Mon, Aug 14, 2017 at 12:14 PM, Park Wu  wrote:

Hi, folks: I am planning to upgrade our production from dsc 2.0.16 to 2.1.18 
for 2 DC (20 nodes each, 600GB per node). Few questions:1), what happen when 
doing rolling upgrade. Let's say we only upgrade one node to new version, 
before upgrade sstable, the data coming in will stay in the node and not be 
able to stream to other nodes?2), What if I have very active writes? how much 
data it can hold until it sees other nodes with new version so it can 
stream?3), Should I upgrade sstable when all nodes in one DC upgraded? or wait 
until all 2 DC upgraded?4), any idea or experience how long it will take to 
upgrade sstable for 600 GB data on each node?5), what is the max time I can 
take for rolling upgrade on each DC?6), I was doing a test with 3-nodes 
cluster, one node with 2.1.18, other two are 2.0.16. I got a warning on the 
node with newer version when I tried to create keyspace and insert some sample 
data:"Warning: schema version mismatch detected, which might be caused by DOWN 
nodes; if this is not the case, check the schema versions of your nodes in 
system.local and system.peers. OperationTimedOu t: errors={}, last_host=xxx"But 
data upserted successfully, even not be seen on other nodes. Any 
suggestion?Great thanks for any help or comments!- Park



   

Re: Migrate from DSE (Datastax) to Apache Cassandra

2017-08-16 Thread Felipe Esteves
Ioannis,
As some people already said, there's one or two keyspaces that uses
EverywhereStrategy, dse_system is one of them, if I'm not wrong.
You  must remember to change them to a community strategy or it will fail.

-- 



Full table scan with cassandra

2017-08-16 Thread Alex Kotelnikov
Hey,

we are trying Cassandra as an alternative for storage huge stream of data
coming from our customers.

Storing works quite fine, and I started to validate how retrieval does. We
have two types of that: fetching specific records and bulk retrieval for
general analysis.
Fetching single record works like charm. But it is not so with bulk fetch.

With a moderately small table of ~2 million records, ~10Gb raw data I
observed very slow operation (using token(partition key) ranges). It takes
minutes to perform full retrieval. We tried a couple of configurations
using virtual machines, real hardware and overall looks like it is not
possible to all table data in a reasonable time (by reasonable I mean that
since we have 1Gbit network 10Gb can be transferred in a couple of minutes
from one server to another and when we have 10+ cassandra servers and 10+
spark executors total time should be even smaller).

I tried datastax spark connector. Also I wrote a simple test case using
datastax java driver and see how fetch of 10k records takes ~10s so I
assume that "sequential" scan will take 200x more time, equals ~30 minutes.

May be we are totally wrong trying to use Cassandra this way?

-- 

Best Regards,


*Alexander Kotelnikov*

*Team Lead*

DIGINETICA
Retail Technology Company

m: +7.921.915.06.28

*www.diginetica.com *


Re: Migrate from DSE (Datastax) to Apache Cassandra

2017-08-16 Thread Ioannis Zafiropoulos
We use NetworkTopologyStrategy as the replication strategy.

The only DSE specific features we use (left untouched by default) are:
authenticator: com.datastax.bdp.cassandra.auth.DseAuthenticator
authorizer: com.datastax.bdp.cassandra.auth.DseAuthorizer
role_manager: com.datastax.bdp.cassandra.auth.DseRoleManager

So I hope by changing these to the COSS recommended ones, before the
migration, DSE will be able to switch to them on its own (?)
and then do the final transition of tarball installation.

Thank you all for your answers!

On Tue, Aug 15, 2017 at 10:42 PM, Jon Haddad 
wrote:

> I agree with Jeff, it’s not necessary to launch a new cluster for this
> operation.
>
> On Aug 15, 2017, at 7:39 PM, Jeff Jirsa  wrote:
>
> Or just alter the key space replication strategy and remove the DSE
> specific strategies in favor of network topology strategy
>
>
> --
> Jeff Jirsa
>
>
> On Aug 15, 2017, at 7:26 PM, Erick Ramirez  wrote:
>
> Ioannis, it's not a straightforward process to migrate from DSE to COSS.
> There are some parts of DSE which are not recognised by COSS, e.g.
> EverywhereStrategy for replication only known to DSE.
>
> You are better off standing up a new COSS 3.11 cluster and restore app
> keyspaces to the new cluster. Cheers!
>
> On Wed, Aug 16, 2017 at 6:33 AM, Ioannis Zafiropoulos 
> wrote:
>
>> Hi all,
>>
>> We have setup a new cluster DSE 5.1.2 (with Cassandra 3.11.0.1758) and we
>> want to migrate it to Apache Cassandra 3.11.0 without loosing schema or
>> data.
>>
>> Anybody, has done it before?
>>
>> Obviously we are going to test this, but it would be nice to hear if
>> somebody else has gone through with the procedure.
>>
>> Thank you!
>>
>
>
>


RE: Attempted to write commit log entry for unrecognized table

2017-08-16 Thread Myron A. Semack
Restarting the Cassandra service resolved this issue.  Thanks for your advice!

Sincerely,
Myron A. Semack


From: kurt greaves [mailto:k...@instaclustr.com]
Sent: Tuesday, August 15, 2017 6:10 PM
To: User 
Subject: Re: Attempted to write commit log entry for unrecognized table

what does nodetool describecluster show?
stab in the dark but you could try nodetool resetlocalschema  or a rolling 
restart of the cluster if it's schema issues.