AW: question on maximum disk seeks

2017-03-20 Thread j.kesten
Hi,

youre right – one seek with hit in the partition key cache and two if not.

Thats the theory – but two thinge to mention:

First, you need two seeks per sstable not per entire read. So if you data is 
spread over multiple sstables on disk you obviously need more then two reads. 
Think of often updated partition keys – in combination with memory preassure 
you can easily end up with maaany sstables (ok they will be compacted some time 
in the future).

Second, there could be fragmentation on disk which leads to seeks during 
sequential reads. 

Jan

Gesendet von meinem Windows 10 Phone

Von: preetika tyagi
Gesendet: Montag, 20. März 2017 21:18
An: user@cassandra.apache.org
Betreff: question on maximum disk seeks



I'm trying to understand the maximum number of disk seeks required in a read 
operation in Cassandra. I looked at several online articles including this one: 
https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
As per my understanding, two disk seeks are required in the worst case. One is 
for reading the partition index and another is to read the actual data from the 
compressed partition. The index of the data in compressed partitions is 
obtained from the compression offset tables (which is stored in memory). Am I 
on the right track here? Will there ever be a case when more than 1 disk seek 
is required to read the data?
Thanks,
Preetika




Re: How can I scale my read rate?

2017-03-20 Thread Alain Rastoul

On 20/03/2017 22:05, Michael Wojcikiewicz wrote:

Not sure if someone has suggested this, but I believe it's not
sufficient to simply add nodes to a cluster to increase read
performance: you also need to alter the ReplicationFactor of the
keyspace to a larger value as you increase your cluster gets larger.

ie. data is available from more nodes in the cluster for each query.

Yes, good point in case of cluster growth, there would be more replica 
to handle same key ranges.

And also readjust token ranges :
https://cassandra.apache.org/doc/latest/operating/topo_changes.html

SG, can you give some information (or share your code) about how you 
generate your data and how you read it ?


--
best,
Alain



Scrubbing corrupted SStable.

2017-03-20 Thread Pranay akula
I am trying to scrub a Column family using nodetool scrub,  is it going to
create snapshots for sstables which are corrupted or for all the sstables
it is going to scrub ?? and to remove snapshots created does running
nodetool clearsnapshot is enough or do i need to manually delete pre-scrub
data from snapshots of that Column family ??

I can see significant increase in Data after starting scrub.




Thanks
Pranay.


Re: Consistency Level vs. Retry Policy when no local nodes are available

2017-03-20 Thread Ben Slater
I think the general assumption is that DC failover happens at the client
app level rather than the Cassandra level due to the potentially very
significant difference in request latency if you move from a app-local DC
to a remote DC. The preferred pattern for most people is that the app fails
in a failed  DC and some load balancer above the app redirects traffic to a
different DC.

The other factor is that the fail-back scenario from a failed DC and
LOCAL_* consistencies is potentially complex. Do you want to immediately
start using the new DC when it becomes available (with missing data) or
wait until it catches up on writes (and how do you know when that has
happened)?

Note also QUORUM is a clear majority of replicas across both DCs. Some
people run 3 DCs with RF 3 in each and QUORUM to maintain strong
consistency across DCs even with DC failure.

Cheers
Ben

On Tue, 21 Mar 2017 at 10:00 Shannon Carey  wrote:

Specifically, this puts us in an awkward position because LOCAL_QUORUM is
desirable so that we don't have unnecessary cross-DC traffic from the
client by default, but we can't use it because it will cause complete
failure if the local DC goes down. And we can't use QUORUM because it would
fail if there's not a quorum in either DC (as would happen if one DC goes
down). So it seems like we are forced to use a lesser consistency such as
ONE or TWO.

-Shannon

From: Shannon Carey 
Date: Monday, March 20, 2017 at 5:25 PM
To: "user@cassandra.apache.org" 
Subject: Consistency Level vs. Retry Policy when no local nodes are
available

I am running DSE 5.0, and I have a Java client using the Datastax 3.0.0
client library.

The client is configured to use a DCAwareRoundRobinPolicy wrapped in a
TokenAwarePolicy. Nothing special.

When I run my query, I set a custom retry policy.

I am testing cross-DC failover. I have disabled connectivity to the "local"
DC (relative to my client) in order to perform the test. When I run a query
with the first consistency level set to LOCAL_ONE (or local anything), my
retry policy is never called and I always get this exception:
"com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s)
tried for query failed (no host was tried)"

getErrors() on the exception is empty.

This is contrary to my expectation that the first attempt would fail and
would allow my RetryPolicy to attempt a different (non-LOCAL) consistency
level. I have no choice but to avoid using any kind of LOCAL consistency
level throughout my applications. Is this expected? Or is there anything I
can do about it? Thanks! It certainly seems like a bug to me or at least
something that should be improved.

-Shannon

-- 


*Ben Slater*

*Chief Product Officer *

   


Read our latest technical blog posts here
.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


Re: Consistency Level vs. Retry Policy when no local nodes are available

2017-03-20 Thread Shannon Carey
Specifically, this puts us in an awkward position because LOCAL_QUORUM is 
desirable so that we don't have unnecessary cross-DC traffic from the client by 
default, but we can't use it because it will cause complete failure if the 
local DC goes down. And we can't use QUORUM because it would fail if there's 
not a quorum in either DC (as would happen if one DC goes down). So it seems 
like we are forced to use a lesser consistency such as ONE or TWO.

-Shannon

From: Shannon Carey >
Date: Monday, March 20, 2017 at 5:25 PM
To: "user@cassandra.apache.org" 
>
Subject: Consistency Level vs. Retry Policy when no local nodes are available

I am running DSE 5.0, and I have a Java client using the Datastax 3.0.0 client 
library.

The client is configured to use a DCAwareRoundRobinPolicy wrapped in a 
TokenAwarePolicy. Nothing special.

When I run my query, I set a custom retry policy.

I am testing cross-DC failover. I have disabled connectivity to the "local" DC 
(relative to my client) in order to perform the test. When I run a query with 
the first consistency level set to LOCAL_ONE (or local anything), my retry 
policy is never called and I always get this exception:
"com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) 
tried for query failed (no host was tried)"

getErrors() on the exception is empty.

This is contrary to my expectation that the first attempt would fail and would 
allow my RetryPolicy to attempt a different (non-LOCAL) consistency level. I 
have no choice but to avoid using any kind of LOCAL consistency level 
throughout my applications. Is this expected? Or is there anything I can do 
about it? Thanks! It certainly seems like a bug to me or at least something 
that should be improved.

-Shannon


Re: Very odd & inconsistent results from SASI query

2017-03-20 Thread Voytek Jarnot
Apologies for the stream-of-consciousness replies, but are the dropped
message stats output by tpstats an accumulation since the node came up, or
are there processes which clear and/or time-out the info?

On Mon, Mar 20, 2017 at 3:18 PM, Voytek Jarnot 
wrote:

> No dropped messages in tpstats on any of the nodes.
>
> On Mon, Mar 20, 2017 at 3:11 PM, Voytek Jarnot 
> wrote:
>
>> Appreciate the reply, Kurt.
>>
>> I sanitized it out of the traces, but all trace outputs listed the same
>> node for all three queries (1 working, 2 not working). Read repair chance
>> set to 0.0 as recommended when using TWCS.
>>
>> I'll check tpstats - in this environment, load is not an issue, but
>> network issues may be.
>>
>> On Mon, Mar 20, 2017 at 2:42 PM, kurt greaves 
>> wrote:
>>
>>> As secondary indexes are stored individually on each node what you're
>>> suggesting sounds exactly like a consistency issue. the fact that you read
>>> 0 cells on one query implies the node that got the query did not have any
>>> data for the row. The reason you would sometimes see different behaviours
>>> is likely because of read repairs. The fact that the repair guides the
>>> issue pretty much guarantees it's a consistency issue.
>>>
>>> You should check for dropped mutations in tpstats/logs and if they are
>>> occurring try and stop that from happening (probably load related). You
>>> could also try performing reads and writes at LOCAL_QUORUM for stronger
>>> consistency, however note this has a performance/latency impact.
>>>
>>>
>>>
>>
>


Re: question on maximum disk seeks

2017-03-20 Thread Jeff Jirsa


On 2017-03-20 13:17 (-0700), preetika tyagi  wrote: 
> I'm trying to understand the maximum number of disk seeks required in a
> read operation in Cassandra. I looked at several online articles including
> this one:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
> 
> As per my understanding, two disk seeks are required in the worst case. One
> is for reading the partition index and another is to read the actual data
> from the compressed partition. The index of the data in compressed
> partitions is obtained from the compression offset tables (which is stored
> in memory). Am I on the right track here? Will there ever be a case when
> more than 1 disk seek is required to read the data?
> 

That sounds right, but do note that it's PER SSTABLE in which the data is 
stored (or in which there's a bloom filter false positive). 



Re: Very odd & inconsistent results from SASI query

2017-03-20 Thread Voytek Jarnot
No dropped messages in tpstats on any of the nodes.

On Mon, Mar 20, 2017 at 3:11 PM, Voytek Jarnot 
wrote:

> Appreciate the reply, Kurt.
>
> I sanitized it out of the traces, but all trace outputs listed the same
> node for all three queries (1 working, 2 not working). Read repair chance
> set to 0.0 as recommended when using TWCS.
>
> I'll check tpstats - in this environment, load is not an issue, but
> network issues may be.
>
> On Mon, Mar 20, 2017 at 2:42 PM, kurt greaves 
> wrote:
>
>> As secondary indexes are stored individually on each node what you're
>> suggesting sounds exactly like a consistency issue. the fact that you read
>> 0 cells on one query implies the node that got the query did not have any
>> data for the row. The reason you would sometimes see different behaviours
>> is likely because of read repairs. The fact that the repair guides the
>> issue pretty much guarantees it's a consistency issue.
>>
>> You should check for dropped mutations in tpstats/logs and if they are
>> occurring try and stop that from happening (probably load related). You
>> could also try performing reads and writes at LOCAL_QUORUM for stronger
>> consistency, however note this has a performance/latency impact.
>>
>>
>>
>


question on maximum disk seeks

2017-03-20 Thread preetika tyagi
I'm trying to understand the maximum number of disk seeks required in a
read operation in Cassandra. I looked at several online articles including
this one:
https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html

As per my understanding, two disk seeks are required in the worst case. One
is for reading the partition index and another is to read the actual data
from the compressed partition. The index of the data in compressed
partitions is obtained from the compression offset tables (which is stored
in memory). Am I on the right track here? Will there ever be a case when
more than 1 disk seek is required to read the data?

Thanks,

Preetika


Re: Very odd & inconsistent results from SASI query

2017-03-20 Thread Voytek Jarnot
Appreciate the reply, Kurt.

I sanitized it out of the traces, but all trace outputs listed the same
node for all three queries (1 working, 2 not working). Read repair chance
set to 0.0 as recommended when using TWCS.

I'll check tpstats - in this environment, load is not an issue, but network
issues may be.

On Mon, Mar 20, 2017 at 2:42 PM, kurt greaves  wrote:

> As secondary indexes are stored individually on each node what you're
> suggesting sounds exactly like a consistency issue. the fact that you read
> 0 cells on one query implies the node that got the query did not have any
> data for the row. The reason you would sometimes see different behaviours
> is likely because of read repairs. The fact that the repair guides the
> issue pretty much guarantees it's a consistency issue.
>
> You should check for dropped mutations in tpstats/logs and if they are
> occurring try and stop that from happening (probably load related). You
> could also try performing reads and writes at LOCAL_QUORUM for stronger
> consistency, however note this has a performance/latency impact.
>
>
>


Re: Purge data from repair_history table?

2017-03-20 Thread Jonathan Haddad
default_time_to_live is a convenience parameter that automatically applies
a TTL to incoming data.  Every field that's inserted can have a separate
TTL.

The TL;DR of all this is that changing default_time_to_live doesn't change
any existing data retroactively.  You'd have to truncate the table if you
want to drop the old data.

On Mon, Mar 20, 2017 at 12:06 PM Gábor Auth  wrote:

> Hi,
>
> On Fri, Mar 17, 2017 at 2:22 PM Paulo Motta 
> wrote:
>
> It's safe to truncate this table since it's just used to inspect repairs
> for troubleshooting. You may also set a default TTL to avoid it from
> growing unbounded (this is going to be done by default on CASSANDRA-12701).
>
>
> I've made an alter on the repair_history and the parent_repair_history
> tables:
> ALTER TABLE system_distributed.repair_history WITH compaction =
> {'class':'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> 'compaction_window_unit':'DAYS', 'compaction_window_size':'1'
> } AND default_time_to_live = 2592000;
>
> Is it affect the previous contents in the table or I need to truncate
> manually? Is the 'TRUNCATE' safe? :)
>
> Bye,
> Gábor Auth
>


Re: Very odd & inconsistent results from SASI query

2017-03-20 Thread kurt greaves
As secondary indexes are stored individually on each node what you're
suggesting sounds exactly like a consistency issue. the fact that you read
0 cells on one query implies the node that got the query did not have any
data for the row. The reason you would sometimes see different behaviours
is likely because of read repairs. The fact that the repair guides the
issue pretty much guarantees it's a consistency issue.

You should check for dropped mutations in tpstats/logs and if they are
occurring try and stop that from happening (probably load related). You
could also try performing reads and writes at LOCAL_QUORUM for stronger
consistency, however note this has a performance/latency impact.


Re: Purge data from repair_history table?

2017-03-20 Thread Gábor Auth
Hi,

On Fri, Mar 17, 2017 at 2:22 PM Paulo Motta 
wrote:

> It's safe to truncate this table since it's just used to inspect repairs
> for troubleshooting. You may also set a default TTL to avoid it from
> growing unbounded (this is going to be done by default on CASSANDRA-12701).
>

I've made an alter on the repair_history and the parent_repair_history
tables:
ALTER TABLE system_distributed.repair_history WITH compaction =
{'class':'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
'compaction_window_unit':'DAYS', 'compaction_window_size':'1'
} AND default_time_to_live = 2592000;

Is it affect the previous contents in the table or I need to truncate
manually? Is the 'TRUNCATE' safe? :)

Bye,
Gábor Auth


Re: repair performance

2017-03-20 Thread daemeon reiydelle
I would zero in on network throughput, especially interrack trunks


sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On Mar 17, 2017 2:07 PM, "Roland Otta"  wrote:

> hello,
>
> we are quite inexperienced with cassandra at the moment and are playing
> around with a new cluster we built up for getting familiar with
> cassandra and its possibilites.
>
> while getting familiar with that topic we recognized that repairs in
> our cluster take a long time. To get an idea of our current setup here
> are some numbers:
>
> our cluster currently consists of 4 nodes (replication factor 3).
> these nodes are all on dedicated physical hardware in our own
> datacenter. all of the nodes have
>
> 32 cores @2,9Ghz
> 64 GB ram
> 2 ssds (raid0) 900 GB each for data
> 1 seperate hdd for OS + commitlogs
>
> current dataset:
> approx 530 GB per node
> 21 tables (biggest one has more than 200 GB / node)
>
>
> i already tried setting compactionthroughput + streamingthroughput to
> unlimited for testing purposes ... but that did not change anything.
>
> when checking system resources i cannot see any bottleneck (cpus are
> pretty idle and we have no iowaits).
>
> when issuing a repair via
>
> nodetool repair -local on a node the repair takes longer than a day.
> is this normal or could we normally expect a faster repair?
>
> i also recognized that initalizing of new nodes in the datacenter was
> really slow (approx 50 mbit/s). also here i expected a much better
> performance - could those 2 problems be somehow related?
>
> br//
> roland


Re: spikes in blocked native transport requests

2017-03-20 Thread benjamin roth
Did you check STW GCs?
You can do that with 'nodetool gcstats', by looking at the gc.log or
observing GC related JMX metrics.

2017-03-20 8:52 GMT+01:00 Roland Otta :

> we have a datacenter which is currently used exlusively for spark batch
> jobs.
>
> in case batch jobs are running against that environment we can see very
> high peaks in blocked native transport requests (up to 10k / minute).
>
> i am concerned because i guess that will slow other queries (in case
> other applications are going to use that dc as well).
>
> i already tried increasing native_transport_max_threads +
> concurrent_reads without success.
>
> during the jobs i cant find any resource limitiations on my hardware
> (iops, disk usage, cpu, ... is fine).
>
> am i missing something? any suggestions how to cope with that?
>
> br//
> roland
>
>
>


spikes in blocked native transport requests

2017-03-20 Thread Roland Otta
we have a datacenter which is currently used exlusively for spark batch
jobs.

in case batch jobs are running against that environment we can see very
high peaks in blocked native transport requests (up to 10k / minute).

i am concerned because i guess that will slow other queries (in case
other applications are going to use that dc as well).

i already tried increasing native_transport_max_threads +
concurrent_reads without success.

during the jobs i cant find any resource limitiations on my hardware
(iops, disk usage, cpu, ... is fine).

am i missing something? any suggestions how to cope with that?

br//
roland


 

Re: How can I scale my read rate?

2017-03-20 Thread Alain Rastoul

On 20/03/2017 02:35, S G wrote:

2)
https://docs.datastax.com/en/developer/java-driver/3.1/manual/statements/prepared/
tells me to avoid preparing select queries if I expect a change of
columns in my table down the road.
The problem is also related to select * which is considered bad practice 
with most databases...



I did some more testing to see if my client machines were the bottleneck.
For a 6-node Cassandra cluster (each VM having 8-cores), I got 26,000
reads/sec for all of the following:
1) Client nodes:1, Threads: 60
2) Client nodes:3, Threads: 180
3) Client nodes:5, Threads: 300
4) Client nodes:10, Threads: 600
5) Client nodes:20, Threads: 1200

So adding more client nodes or threads to those client nodes is not
having any effect.
I am suspecting Cassandra is simply not allowing me to go any further.

> Primary keys for my schema are:
>  PRIMARY KEY((name, phone), age)
> name: text
> phone: int
> age: int

Yes with such a PK data must be spread on the whole cluster (also taking 
into account the partitioner), strange that the throughput doesn't scale.

I guess you also have verified that you select data randomly?

May be you could have a look at the system traces to see the query plan 
for some requests:
If you are on a test cluster you can truncate the tables before 
(truncate system_traces.sessions; and truncate system_traces.events;), 
run a test then select * from system_traces.events

where session_id = 
xxx being one of the sessions you pick in trace.sessions.

Try to see if you are not always hitting the same nodes.


--
best,
Alain



Re: repair performance

2017-03-20 Thread Roland Otta
good point! i did not (so far) i will do that - especially because i often see 
all compaction threads being used during repair (according to compactionstats).

thank you also for your link recommendations. i will go through them.



On Sat, 2017-03-18 at 16:54 +, Thakrar, Jayesh wrote:
You changed compaction_throughput_mb_per_sec, but did you also increase 
concurrent_compactors?

In reference to the reaper and some other info I received on the user forum to 
my question on "nodetool repair", here are some useful links/slides -



https://www.datastax.com/dev/blog/repair-in-cassandra



https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra/



http://www.slideshare.net/DataStax/real-world-tales-of-repair-alexander-dejanovski-the-last-pickle-cassandra-summit-2016



http://www.slideshare.net/DataStax/real-world-repairs-vinay-chella-netflix-cassandra-summit-2016




From: Roland Otta 
Date: Friday, March 17, 2017 at 5:47 PM
To: "user@cassandra.apache.org" 
Subject: Re: repair performance

did not recognize that so far.

thank you for the hint. i will definitely give it a try

On Fri, 2017-03-17 at 22:32 +0100, benjamin roth wrote:
The fork from thelastpickle is. I'd recommend to give it a try over pure 
nodetool.

2017-03-17 22:30 GMT+01:00 Roland Otta 
>:

forgot to mention the version we are using:

we are using 3.0.7 - so i guess we should have incremental repairs by default.
it also prints out incremental:true when starting a repair
INFO  [Thread-7281] 2017-03-17 09:40:32,059 RepairRunnable.java:125 - Starting 
repair command #7, repairing keyspace xxx with repair options (parallelism: 
parallel, primary range: false, incremental: true, job threads: 1, 
ColumnFamilies: [], dataCenters: [ProdDC2], hosts: [], # of ranges: 1758)

3.0.7 is also the reason why we are not using reaper ... as far as i could 
figure out it's not compatible with 3.0+



On Fri, 2017-03-17 at 22:13 +0100, benjamin roth wrote:
It depends a lot ...

- Repairs can be very slow, yes! (And unreliable, due to timeouts, outages, 
whatever)
- You can use incremental repairs to speed things up for regular repairs
- You can use "reaper" to schedule repairs and run them sliced, automated, 
failsafe

The time repairs actually may vary a lot depending on how much data has to be 
streamed or how inconsistent your cluster is.

50mbit/s is really a bit low! The actual performance depends on so many factors 
like your CPU, RAM, HD/SSD, concurrency settings, load of the "old nodes" of 
the cluster.
This is a quite individual problem you have to track down individually.

2017-03-17 22:07 GMT+01:00 Roland Otta 
>:

hello,

we are quite inexperienced with cassandra at the moment and are playing
around with a new cluster we built up for getting familiar with
cassandra and its possibilites.

while getting familiar with that topic we recognized that repairs in
our cluster take a long time. To get an idea of our current setup here
are some numbers:

our cluster currently consists of 4 nodes (replication factor 3).
these nodes are all on dedicated physical hardware in our own
datacenter. all of the nodes have

32 cores @2,9Ghz
64 GB ram
2 ssds (raid0) 900 GB each for data
1 seperate hdd for OS + commitlogs

current dataset:
approx 530 GB per node
21 tables (biggest one has more than 200 GB / node)


i already tried setting compactionthroughput + streamingthroughput to
unlimited for testing purposes ... but that did not change anything.

when checking system resources i cannot see any bottleneck (cpus are
pretty idle and we have no iowaits).

when issuing a repair via

nodetool repair -local on a node the repair takes longer than a day.
is this normal or could we normally expect a faster repair?

i also recognized that initalizing of new nodes in the datacenter was
really slow (approx 50 mbit/s). also here i expected a much better
performance - could those 2 problems be somehow related?

br//
roland