Re: Unexplainably large reported partition sizes

2016-03-07 Thread Nate McCall
>
>
> Rob, can you remember which bug/jira this was? I have not been able to
> find it.
> I'm using 2.1.9.
>
>
https://issues.apache.org/jira/browse/CASSANDRA-7953

Rob may have a different one, but I've something similar from this issue.
Fixed in 2.1.12.


-- 
-
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Re: Unexplainably large reported partition sizes

2016-03-07 Thread Tom van den Berge
Hi Bryan,


> Do you use any collections on this column family? We've had issues in the
> past with unexpectedly large partitions reported on data models with
> collections, which can also generate tons of tombstones on UPDATE (
> https://issues.apache.org/jira/browse/CASSANDRA-10547)
>

 I've been bitten by this one some time ago, too. I stopped using
collections because of this. The table in question doesn't use them either.

Thanks for the suggestion anyway!
Tom


Re: Unexplainably large reported partition sizes

2016-03-07 Thread Tom van den Berge
Hi Rob,

The reason I didn't dump the table with sstable2json is that I didn't think
of it ;) I just used it, and it looks very much like the "avalanche of
tombstones" bug you are describing!

I took one of the three sstables containing the key, and it resulted in a
4.75 million-line json file, of which 4.73 million lines contain a
tombstone ("t") !
The timestamps of the tombstones I've checked were all many months old, so
obviously compaction failed to clean them up. I can also see many, many
identical tombstoned rows.

Rob, can you remember which bug/jira this was? I have not been able to find
it.
I'm using 2.1.9.

Thanks a lot for pointing me in this direction!
Tom


Re: Unexplainably large reported partition sizes

2016-03-07 Thread Bryan Cheng
Hi Tom,

Do you use any collections on this column family? We've had issues in the
past with unexpectedly large partitions reported on data models with
collections, which can also generate tons of tombstones on UPDATE (
https://issues.apache.org/jira/browse/CASSANDRA-10547)

--Bryan


On Mon, Mar 7, 2016 at 11:23 AM, Robert Coli  wrote:

> On Sat, Mar 5, 2016 at 9:16 AM, Tom van den Berge 
> wrote:
>
>> I don't think compression can be the cause of the difference, because of
>> two reasons:
>>
>
> Your two reasons seem legitimate.
>
> Though you say you do not frequently do DELETE and so it shouldn't be due
> to tombstones, there are semi-recent versions of Cassandra which create a
> runaway avalanche of tombstones that double every time they are compacted.
> What version are you running?
>
> Also, is there some reason you are not just dumping the table with
> sstable2json and inspecting the contents of the row in question?
>
> =Rob
>
>
>
>


Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-07 Thread Ben Bromhead
+1 for
http://opensourceconnections.com/blog/2013/08/31/building-
the-perfect-cassandra-test-environment/



We also run Cassandra on t2.mediums for our Developer clusters. You can
force Cassandra to do most "memory" things by hitting the disk instead (on
disk compaction passes, flush immediately to disk) and by throttling client
connections. In fact on the t2 series memory is not the biggest concern,
but rather the CPU credit issue.

On Mon, 7 Mar 2016 at 11:53 Robert Coli  wrote:

> On Fri, Mar 4, 2016 at 8:27 PM, Jack Krupansky 
> wrote:
>
>> Please review the minimum hardware requirements as clearly documented:
>>
>> http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningHardware.html
>>
>
> That is a document for Datastax Cassandra, not Apache Cassandra. It's
> wonderful that Datastax provides docs, but Datastax Cassandra is a superset
> of Apache Cassandra. Presuming that the requirements of one are exactly
> equivalent to the requirements of the other is not necessarily reasonable.
>
> Please adjust your hardware usage to at least meet the clearly documented
>> minimum requirements. If you continue to encounter problems once you have
>> corrected your configuration error, please resubmit the details with
>> updated hardware configuration details.
>>
>
> Disagree. OP specifically stated that they knew this was not a recommended
> practice. It does not seem unlikely that they are constrained to use this
> hardware for reasons outside of their control.
>
>
>> Just to be clear, development on less than 4 GB is not supported and
>> production on less than 8 GB is not supported. Those are not suggestions or
>> guidelines or recommendations, they are absolute requirements.
>>
>
> What does "supported" mean here? That Datastax will not provide support if
> you do not follow the above recommendations? Because it certainly is
> "supported" in the sense of "it can be made to work" ... ?
>
> The premise of a minimum RAM level seems meaningless without context. How
> much data are you serving from your 2GB RAM node? What is the rate of
> client requests?
>
> To be clear, I don't recommend trying to run production Cassandra with
> under 8GB of RAM on your node, but "absolute requirement" is a serious
> overstatement.
>
>
> http://opensourceconnections.com/blog/2013/08/31/building-the-perfect-cassandra-test-environment/
>
> Has some good discussion of how to run Cassandra in a low memory
> environment. Maybe someone should tell John that his 64MB of JVM heap for a
> test node is 62x too small to be "supported"? :D
>
> =Rob
>
> --
Ben Bromhead
CTO | Instaclustr 
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer


Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-07 Thread Robert Coli
On Fri, Mar 4, 2016 at 8:27 PM, Jack Krupansky 
wrote:

> Please review the minimum hardware requirements as clearly documented:
>
> http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningHardware.html
>

That is a document for Datastax Cassandra, not Apache Cassandra. It's
wonderful that Datastax provides docs, but Datastax Cassandra is a superset
of Apache Cassandra. Presuming that the requirements of one are exactly
equivalent to the requirements of the other is not necessarily reasonable.

Please adjust your hardware usage to at least meet the clearly documented
> minimum requirements. If you continue to encounter problems once you have
> corrected your configuration error, please resubmit the details with
> updated hardware configuration details.
>

Disagree. OP specifically stated that they knew this was not a recommended
practice. It does not seem unlikely that they are constrained to use this
hardware for reasons outside of their control.


> Just to be clear, development on less than 4 GB is not supported and
> production on less than 8 GB is not supported. Those are not suggestions or
> guidelines or recommendations, they are absolute requirements.
>

What does "supported" mean here? That Datastax will not provide support if
you do not follow the above recommendations? Because it certainly is
"supported" in the sense of "it can be made to work" ... ?

The premise of a minimum RAM level seems meaningless without context. How
much data are you serving from your 2GB RAM node? What is the rate of
client requests?

To be clear, I don't recommend trying to run production Cassandra with
under 8GB of RAM on your node, but "absolute requirement" is a serious
overstatement.

http://opensourceconnections.com/blog/2013/08/31/building-the-perfect-cassandra-test-environment/

Has some good discussion of how to run Cassandra in a low memory
environment. Maybe someone should tell John that his 64MB of JVM heap for a
test node is 62x too small to be "supported"? :D

=Rob


Re: Unexplainably large reported partition sizes

2016-03-07 Thread Robert Coli
On Sat, Mar 5, 2016 at 9:16 AM, Tom van den Berge  wrote:

> I don't think compression can be the cause of the difference, because of
> two reasons:
>

Your two reasons seem legitimate.

Though you say you do not frequently do DELETE and so it shouldn't be due
to tombstones, there are semi-recent versions of Cassandra which create a
runaway avalanche of tombstones that double every time they are compacted.
What version are you running?

Also, is there some reason you are not just dumping the table with
sstable2json and inspecting the contents of the row in question?

=Rob


Re: moving keyspaces to another disk while Cassandra is running

2016-03-07 Thread Robert Coli
On Mon, Mar 7, 2016 at 2:57 AM, Krzysztof Księżyk 
wrote:

> I see on lsof output that even if keyspace
> is not queried, Cassandra keeps files opened, so I guess it's not safe to
> hotswap, but I'd like to make sure.
>

It is not safe for exactly this reason. Just restart your nodes.

Were I doing this process, I would :

1) do initial rsync
2) stop node
3) do rsync again, with --delete for files which are no longer in the
source. This is very important, or you risk resurrecting SSTables which
have already been compacted away, which can be PERMANENTLY FATAL TO THE
CONSISTENCY OF ALL INVOLVED DATA.
4) start node

=Rob


Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

2016-03-07 Thread Bhuvan Rawal
Thanks for the correction Jon. (Atmost 2000 queries *per cluster* for
serving 100 searches.)

On Mon, Mar 7, 2016 at 11:47 PM, Jonathan Haddad  wrote:

> If you're doing 100 searches a second each machine will be serving at most
> 100 requests per second, not 2000.
>
> On Mon, Mar 7, 2016 at 10:13 AM Bhuvan Rawal  wrote:
>
>> Well thats certainly true, there are these points worth discussing here :
>>
>> 1. Scatter Gather queries - Especially if the cluster size is large. Say
>> we have a 20 node cluster, and we are searching 100 times a second. then
>> effectively coordinator would be hitting each node 2000 times (20*100) That
>> factor will only increase as the number of node goes higher. Im sure having
>> a centralized index alleviates that problem.
>> 2. High Cardinality (For columns like email / phone number)
>> 3. Low Cardinality (Boolean column or any column with limited set of
>> available options).
>>
>> SASI seems to be a good solution for Like queries this doc
>>  looks
>> really promising. But wouldn't it be better to tackle the use cases of
>> search differently than from data storage ones, from a design standpoint?
>>
>> On Sun, Mar 6, 2016 at 9:14 PM, Jack Krupansky 
>> wrote:
>>
>>> I don't have any direct personal experience with Stratio. It will all
>>> depend on your queries and your data cardinality - some queries are fine
>>> with secondary indexes while other are quite poor. Ditto for Lucene and
>>> Solr.
>>>
>>> It is also worth noting that the new SASI feature of Cassandra supports
>>> keyword and prefix/suffix search. But it doesn't support multi-column ad
>>> hoc queries, which is what people tend to use Lucene and Solr for. So,
>>> again, it all depends on your queries and your data cardinality.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Mar 6, 2016 at 1:29 AM, Bhuvan Rawal 
>>> wrote:
>>>
 Yes Jack, we are rolling out with Stratio right now, we will assess the
 performance benefit it yields and can go for ElasticSearch/Solr later.

 As per your experience how does Stratio perform vis-a-vis Secondary
 Indexes?

 On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky <
 jack.krupan...@gmail.com> wrote:

> You haven't been clear about how you intend to add Solr. You can also
> use Stratio or Stargate for basic Lucene search if you don't want need 
> full
> Solr support and want to stick to open source rather than go with DSE
> Search for Solr.
>
> -- Jack Krupansky
>
> On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal 
> wrote:
>
>> Thanks Sean and Nirmallaya.
>>
>> @Jack, We are going with DSC right now and plan to use spark and
>> later solr over the analytics DC. The use case is to have  olap and oltp
>> workloads separated and not intertwine them, whether it is achieved by
>> creating a new DC or a new cluster altogether. From Nirmallaya's and 
>> Sean's
>> answer I could understand that its easily achievable by creating a 
>> separate
>> DC, app client will need to be made DC aware and it should not make a
>> coordinator in dc3. And same goes for spark configuration, it should read
>> from 3rd DC. Correct me if I'm wrong.
>>
>> On Mar 4, 2016 7:55 PM, "Jack Krupansky" 
>> wrote:
>> >
>> > DataStax Enterprise (DSE) should be fine for three or even four
>> data centers in the same cluster. Or are you talking about some custom 
>> Solr
>> implementation?
>> >
>> > -- Jack Krupansky
>> >
>> > On Fri, Mar 4, 2016 at 9:21 AM, 
>> wrote:
>> >>
>> >> Sure. Just add a new DC. Alter your keyspaces with a new
>> replication factor for that DC. Run repairs on the new DC to get the data
>> streamed. Then make sure your clients only connect to the DC(s) that they
>> need.
>> >>
>> >>
>> >>
>> >> Separation of workloads is one of the key powers of a Cassandra
>> cluster.
>> >>
>> >>
>> >>
>> >> You may want to look at different configurations for the analytics
>> cluster – smaller replication factor, more memory per node, more disk per
>> node, perhaps less vnodes. Others may chime in with their experience.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Sean Durity
>> >>
>> >>
>> >>
>> >> From: Bhuvan Rawal [mailto:bhu1ra...@gmail.com]
>> >> Sent: Friday, March 04, 2016 3:27 AM
>> >> To: user@cassandra.apache.org
>> >> Subject: How to create an additional cluster in Cassandra
>> exclusively for Analytics Purpose
>> >>
>> >>
>> >>
>> >> Hi,
>> >>
>> >>
>> >>
>> >> We would like to create an additional C* data center for batch
>> processing using spark on CFS. We 

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

2016-03-07 Thread Jonathan Haddad
If you're doing 100 searches a second each machine will be serving at most
100 requests per second, not 2000.

On Mon, Mar 7, 2016 at 10:13 AM Bhuvan Rawal  wrote:

> Well thats certainly true, there are these points worth discussing here :
>
> 1. Scatter Gather queries - Especially if the cluster size is large. Say
> we have a 20 node cluster, and we are searching 100 times a second. then
> effectively coordinator would be hitting each node 2000 times (20*100) That
> factor will only increase as the number of node goes higher. Im sure having
> a centralized index alleviates that problem.
> 2. High Cardinality (For columns like email / phone number)
> 3. Low Cardinality (Boolean column or any column with limited set of
> available options).
>
> SASI seems to be a good solution for Like queries this doc
>  looks really
> promising. But wouldn't it be better to tackle the use cases of search
> differently than from data storage ones, from a design standpoint?
>
> On Sun, Mar 6, 2016 at 9:14 PM, Jack Krupansky 
> wrote:
>
>> I don't have any direct personal experience with Stratio. It will all
>> depend on your queries and your data cardinality - some queries are fine
>> with secondary indexes while other are quite poor. Ditto for Lucene and
>> Solr.
>>
>> It is also worth noting that the new SASI feature of Cassandra supports
>> keyword and prefix/suffix search. But it doesn't support multi-column ad
>> hoc queries, which is what people tend to use Lucene and Solr for. So,
>> again, it all depends on your queries and your data cardinality.
>>
>> -- Jack Krupansky
>>
>> On Sun, Mar 6, 2016 at 1:29 AM, Bhuvan Rawal  wrote:
>>
>>> Yes Jack, we are rolling out with Stratio right now, we will assess the
>>> performance benefit it yields and can go for ElasticSearch/Solr later.
>>>
>>> As per your experience how does Stratio perform vis-a-vis Secondary
>>> Indexes?
>>>
>>> On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky <
>>> jack.krupan...@gmail.com> wrote:
>>>
 You haven't been clear about how you intend to add Solr. You can also
 use Stratio or Stargate for basic Lucene search if you don't want need full
 Solr support and want to stick to open source rather than go with DSE
 Search for Solr.

 -- Jack Krupansky

 On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal 
 wrote:

> Thanks Sean and Nirmallaya.
>
> @Jack, We are going with DSC right now and plan to use spark and later
> solr over the analytics DC. The use case is to have  olap and oltp
> workloads separated and not intertwine them, whether it is achieved by
> creating a new DC or a new cluster altogether. From Nirmallaya's and 
> Sean's
> answer I could understand that its easily achievable by creating a 
> separate
> DC, app client will need to be made DC aware and it should not make a
> coordinator in dc3. And same goes for spark configuration, it should read
> from 3rd DC. Correct me if I'm wrong.
>
> On Mar 4, 2016 7:55 PM, "Jack Krupansky" 
> wrote:
> >
> > DataStax Enterprise (DSE) should be fine for three or even four data
> centers in the same cluster. Or are you talking about some custom Solr
> implementation?
> >
> > -- Jack Krupansky
> >
> > On Fri, Mar 4, 2016 at 9:21 AM,  wrote:
> >>
> >> Sure. Just add a new DC. Alter your keyspaces with a new
> replication factor for that DC. Run repairs on the new DC to get the data
> streamed. Then make sure your clients only connect to the DC(s) that they
> need.
> >>
> >>
> >>
> >> Separation of workloads is one of the key powers of a Cassandra
> cluster.
> >>
> >>
> >>
> >> You may want to look at different configurations for the analytics
> cluster – smaller replication factor, more memory per node, more disk per
> node, perhaps less vnodes. Others may chime in with their experience.
> >>
> >>
> >>
> >>
> >>
> >> Sean Durity
> >>
> >>
> >>
> >> From: Bhuvan Rawal [mailto:bhu1ra...@gmail.com]
> >> Sent: Friday, March 04, 2016 3:27 AM
> >> To: user@cassandra.apache.org
> >> Subject: How to create an additional cluster in Cassandra
> exclusively for Analytics Purpose
> >>
> >>
> >>
> >> Hi,
> >>
> >>
> >>
> >> We would like to create an additional C* data center for batch
> processing using spark on CFS. We would like to limit this DC exclusively
> for Spark operations and would like to continue the Application Servers to
> continue fetching data from OLTP.
> >>
> >>
> >>
> >> Is there any way to configure the same?
> >>
> >>
> >>
> >>
> >> ​
> >>
> >> Regards,
> >>

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

2016-03-07 Thread Bhuvan Rawal
Well thats certainly true, there are these points worth discussing here :

1. Scatter Gather queries - Especially if the cluster size is large. Say we
have a 20 node cluster, and we are searching 100 times a second. then
effectively coordinator would be hitting each node 2000 times (20*100) That
factor will only increase as the number of node goes higher. Im sure having
a centralized index alleviates that problem.
2. High Cardinality (For columns like email / phone number)
3. Low Cardinality (Boolean column or any column with limited set of
available options).

SASI seems to be a good solution for Like queries this doc
 looks really
promising. But wouldn't it be better to tackle the use cases of search
differently than from data storage ones, from a design standpoint?

On Sun, Mar 6, 2016 at 9:14 PM, Jack Krupansky 
wrote:

> I don't have any direct personal experience with Stratio. It will all
> depend on your queries and your data cardinality - some queries are fine
> with secondary indexes while other are quite poor. Ditto for Lucene and
> Solr.
>
> It is also worth noting that the new SASI feature of Cassandra supports
> keyword and prefix/suffix search. But it doesn't support multi-column ad
> hoc queries, which is what people tend to use Lucene and Solr for. So,
> again, it all depends on your queries and your data cardinality.
>
> -- Jack Krupansky
>
> On Sun, Mar 6, 2016 at 1:29 AM, Bhuvan Rawal  wrote:
>
>> Yes Jack, we are rolling out with Stratio right now, we will assess the
>> performance benefit it yields and can go for ElasticSearch/Solr later.
>>
>> As per your experience how does Stratio perform vis-a-vis Secondary
>> Indexes?
>>
>> On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky > > wrote:
>>
>>> You haven't been clear about how you intend to add Solr. You can also
>>> use Stratio or Stargate for basic Lucene search if you don't want need full
>>> Solr support and want to stick to open source rather than go with DSE
>>> Search for Solr.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal 
>>> wrote:
>>>
 Thanks Sean and Nirmallaya.

 @Jack, We are going with DSC right now and plan to use spark and later
 solr over the analytics DC. The use case is to have  olap and oltp
 workloads separated and not intertwine them, whether it is achieved by
 creating a new DC or a new cluster altogether. From Nirmallaya's and Sean's
 answer I could understand that its easily achievable by creating a separate
 DC, app client will need to be made DC aware and it should not make a
 coordinator in dc3. And same goes for spark configuration, it should read
 from 3rd DC. Correct me if I'm wrong.

 On Mar 4, 2016 7:55 PM, "Jack Krupansky" 
 wrote:
 >
 > DataStax Enterprise (DSE) should be fine for three or even four data
 centers in the same cluster. Or are you talking about some custom Solr
 implementation?
 >
 > -- Jack Krupansky
 >
 > On Fri, Mar 4, 2016 at 9:21 AM,  wrote:
 >>
 >> Sure. Just add a new DC. Alter your keyspaces with a new replication
 factor for that DC. Run repairs on the new DC to get the data streamed.
 Then make sure your clients only connect to the DC(s) that they need.
 >>
 >>
 >>
 >> Separation of workloads is one of the key powers of a Cassandra
 cluster.
 >>
 >>
 >>
 >> You may want to look at different configurations for the analytics
 cluster – smaller replication factor, more memory per node, more disk per
 node, perhaps less vnodes. Others may chime in with their experience.
 >>
 >>
 >>
 >>
 >>
 >> Sean Durity
 >>
 >>
 >>
 >> From: Bhuvan Rawal [mailto:bhu1ra...@gmail.com]
 >> Sent: Friday, March 04, 2016 3:27 AM
 >> To: user@cassandra.apache.org
 >> Subject: How to create an additional cluster in Cassandra
 exclusively for Analytics Purpose
 >>
 >>
 >>
 >> Hi,
 >>
 >>
 >>
 >> We would like to create an additional C* data center for batch
 processing using spark on CFS. We would like to limit this DC exclusively
 for Spark operations and would like to continue the Application Servers to
 continue fetching data from OLTP.
 >>
 >>
 >>
 >> Is there any way to configure the same?
 >>
 >>
 >>
 >>
 >> ​
 >>
 >> Regards,
 >>
 >> Bhuvan
 >>
 >>
 >> 
 >>
 >> The information in this Internet Email is confidential and may be
 legally privileged. It is intended solely for the addressee. Access to this
 Email by anyone else is unauthorized. If you are not the intended
 recipient, any disclosure, copying, 

Query regarding filter and where in spark on cassandra

2016-03-07 Thread Siddharth Verma
Hi,
While working with spark running on top of cassandra, I wanted to do some
filtering on data.
It can be done either on server side(where clause while cassandraTable
query is written) or on client side(filter transformation on rdd).
Which one of them is preferred keeping performance and time in mind?

I am using spark java connector.


*References :**1.*
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md
Note: See the description of filtering

 to understand the limitations of the where method.
*2.*
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md
To filter rows, you can use the filter transformation provided by Spark
 To avoid this overhead, CassandraRDD offers the where method, which
lets you pass arbitrary CQL condition(s) to filter the row set on the
server.

Thanks and Regards

Siddharth Verma

*Software Engineer*

CA2125, 2nd Floor, ASF Centre-A, Jwala Mill Road,
Udyog Vihar Phase - IV, Gurgaon-122016, INDIA
Download Our App
[image: A]

[image:
A]

[image:
W]



moving keyspaces to another disk while Cassandra is running

2016-03-07 Thread Krzysztof Księżyk
Hi,

I have small Cassandra cluster running on boxes with 256GB SSD and 2TB HDD. 
Originally SSD was for system and commit log and HDD for data. But 
unfortunately because of nature of queries, performance was not satisfactory 
and to improve it, data were moved to SSD as well. Now problem is with SSD 
size being too small to keep all data. As there's one keyspace created a 
month, my idea was to move historical keyspaces to HDD and make a symlink. 
These historical keyspaces are not queried often so it shouldn't affect 
performance much. I've written simple script that rsyncs data, stops 
Cassandra, makes symlink and starts Cassandra node again, but I have question 
if there will be problem if I make hotswap - without stopping Cassandra 
daemon. So... rsync, rename current keyspace folder in data dir, make symlink 
to new location on HDD. One good thing is that while keyspace is fully feed 
with data, it no longer changes. I see on lsof output that even if keyspace 
is not queried, Cassandra keeps files opened, so I guess it's not safe to 
hotswap, but I'd like to make sure.

Kind regards -
Krzysztof Ksiezyk