Re: Data corruption, invalid UTF-8 bytes

2018-01-03 Thread Stefano Ortolani
Little update.

I've managed to compute the token, and I can indeed SELECT the row from
CQLSH.
Interestingly enough, if I use CQLSH I do not get the exception (even if
the string is printed out).

I am now wondering whether, instead of a data corruption, the error is
related to the reading path used by the java driver, but I fail to see how
that could be different when using CQLSH (python).
Does anybody more familiar with the reading path able to shed some light on
the stack trace?

Thanks,
Stefano

On Tue, Jan 2, 2018 at 6:44 PM, Stefano Ortolani <ostef...@gmail.com> wrote:

> Hi all,
>
> apparently the year started with a node (version 3.0.15) exhibiting some
> data corruption (discovered by a spark job enumerating all keys).
>
> The exception is attached below.
>
> The invalid string is a partition key, and it is supposed to be a file
> name. If I manually decode the bytes I get something that resembles a path
> but with lots of garbage inside.
>
> Now part of the garbage might be intentional, which means I am still
> thinking whether this is an actual data corruption or the input string was
> already corrupted. In the latter case, the string should not have inserted
> though, the driver or Cassandra should have rejected the WRITE, am I
> correct?
>
> I admit I am a bit stuck, because obviously getsstables doesn't work.
> Does anybody have any suggestion how to deal with this situation?
>
> Ideally I would like to identify the sstable, but also check that the
> corruption did not replicate it to other nodes, and possibly delete the
> offending partition.
>
> A starting point would be to generate the token associated with the
> offending key. Since I am using the murmur partitioner I was planning to
> just compute the token value of that byte sequence. Would this be a sound
> approach?
>
> Any suggestion is more than welcome :S
>
> Cheers,
> Stefano
>
> WARN  [SharedPool-Worker-18] 2018-01-02 16:49:43,861
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread
> Thread[SharedPool-Worker-18,5,main]: {}
> org.apache.cassandra.serializers.MarshalException: Invalid UTF-8 bytes
> 433a5c484a5c5858505c17443535575c5d425a5c515f5c20203133203133
> 5c2e4d5c2c364256563b5c203230dbb74c5c2754345c532031452121445c
> 7f584d7f7f555c4e485757455c56203330585c465c203144334f5f5f345c
> 29605c38495c2033415d595c4c335c203134365c364e4a5c2c46e9bdbd48
> 5c5a39397e5c203231edb9b9235cc6a7592d432d435d5c354e595c45495c
> 1738525c442032324747485c55203230265c43355c4b353b565c4429dbb7
> 495c23525c2031442031443b5c45552e141457
> at org.apache.cassandra.serializers.AbstractTextSerializer.deserialize(
> AbstractTextSerializer.java:45) ~[apache-cassandra-3.0.15.jar:3.0.15]
> at org.apache.cassandra.serializers.AbstractTextSerializer.deserialize(
> AbstractTextSerializer.java:28) ~[apache-cassandra-3.0.15.jar:3.0.15]
> at org.apache.cassandra.db.marshal.AbstractType.
> getString(AbstractType.java:130) ~[apache-cassandra-3.0.15.jar:3.0.15]
> at org.apache.cassandra.dht.AbstractBounds.format(AbstractBounds.java:130)
> ~[apache-cassandra-3.0.15.jar:3.0.15]
> at org.apache.cassandra.dht.AbstractBounds.getString(AbstractBounds.java:123)
> ~[apache-cassandra-3.0.15.jar:3.0.15]
> at org.apache.cassandra.db.PartitionRangeReadCommand.queryStorage(
> PartitionRangeReadCommand.java:245) ~[apache-cassandra-3.0.15.jar:3.0.15]
> at org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:405)
> ~[apache-cassandra-3.0.15.jar:3.0.15]
> at org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(
> ReadCommandVerbHandler.java:45) ~[apache-cassandra-3.0.15.jar:3.0.15]
> at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
> ~[apache-cassandra-3.0.15.jar:3.0.15]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_131]
> at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorServ
> ice$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
> ~[apache-cassandra-3.0.15.jar:3.0.15]
> at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorServ
> ice$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
> [apache-cassandra-3.0.15.jar:3.0.15]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> [apache-cassandra-3.0.15.jar:3.0.15]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
>
>
>


Data corruption, invalid UTF-8 bytes

2018-01-02 Thread Stefano Ortolani
Hi all,

apparently the year started with a node (version 3.0.15) exhibiting some
data corruption (discovered by a spark job enumerating all keys).

The exception is attached below.

The invalid string is a partition key, and it is supposed to be a file
name. If I manually decode the bytes I get something that resembles a path
but with lots of garbage inside.

Now part of the garbage might be intentional, which means I am still
thinking whether this is an actual data corruption or the input string was
already corrupted. In the latter case, the string should not have inserted
though, the driver or Cassandra should have rejected the WRITE, am I
correct?

I admit I am a bit stuck, because obviously getsstables doesn't work.
Does anybody have any suggestion how to deal with this situation?

Ideally I would like to identify the sstable, but also check that the
corruption did not replicate it to other nodes, and possibly delete the
offending partition.

A starting point would be to generate the token associated with the
offending key. Since I am using the murmur partitioner I was planning to
just compute the token value of that byte sequence. Would this be a sound
approach?

Any suggestion is more than welcome :S

Cheers,
Stefano

WARN  [SharedPool-Worker-18] 2018-01-02 16:49:43,861
AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread
Thread[SharedPool-Worker-18,5,main]: {}
org.apache.cassandra.serializers.MarshalException: Invalid UTF-8 bytes
433a5c484a5c5858505c17443535575c5d425a5c515f5c202031332031335c2e4d5c2c364256563b5c203230dbb74c5c2754345c532031452121445c7f584d7f7f555c4e485757455c56203330585c465c203144334f5f5f345c29605c38495c2033415d595c4c335c203134365c364e4a5c2c46e9bdbd485c5a39397e5c203231edb9b9235cc6a7592d432d435d5c354e595c45495c1738525c442032324747485c55203230265c43355c4b353b565c4429dbb7495c23525c2031442031443b5c45552e141457
at
org.apache.cassandra.serializers.AbstractTextSerializer.deserialize(AbstractTextSerializer.java:45)
~[apache-cassandra-3.0.15.jar:3.0.15]
at
org.apache.cassandra.serializers.AbstractTextSerializer.deserialize(AbstractTextSerializer.java:28)
~[apache-cassandra-3.0.15.jar:3.0.15]
at
org.apache.cassandra.db.marshal.AbstractType.getString(AbstractType.java:130)
~[apache-cassandra-3.0.15.jar:3.0.15]
at org.apache.cassandra.dht.AbstractBounds.format(AbstractBounds.java:130)
~[apache-cassandra-3.0.15.jar:3.0.15]
at
org.apache.cassandra.dht.AbstractBounds.getString(AbstractBounds.java:123)
~[apache-cassandra-3.0.15.jar:3.0.15]
at
org.apache.cassandra.db.PartitionRangeReadCommand.queryStorage(PartitionRangeReadCommand.java:245)
~[apache-cassandra-3.0.15.jar:3.0.15]
at org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:405)
~[apache-cassandra-3.0.15.jar:3.0.15]
at
org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:45)
~[apache-cassandra-3.0.15.jar:3.0.15]
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
~[apache-cassandra-3.0.15.jar:3.0.15]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[na:1.8.0_131]
at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
~[apache-cassandra-3.0.15.jar:3.0.15]
at
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
[apache-cassandra-3.0.15.jar:3.0.15]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
[apache-cassandra-3.0.15.jar:3.0.15]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]


Re: Bootstrapping a node fails because of compactions not keeping up

2017-10-15 Thread Stefano Ortolani
Nice catch!
I’ve totally overlooked it.

Thanks a lot!
Stefano

On Sun, 15 Oct 2017 at 22:14, Jeff Jirsa <jji...@gmail.com> wrote:

> (Should still be able to complete, unless you’re running out of disk or
> memory or similar, but that’s why it’s streaming more than you expect)
>
>
> --
> Jeff Jirsa
>
>
> On Oct 15, 2017, at 1:51 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>
> I
> You’re adding the new node as rac3
>
> The rack aware policy is going to make sure you get the rack diversity you
> asked for by making sure one replica of each partition is in rac3, which is
> going to blow up that instance
>
>
>
> --
> Jeff Jirsa
>
>
> On Oct 15, 2017, at 1:42 PM, Stefano Ortolani <ostef...@gmail.com> wrote:
>
> Hi Jeff,
>
> this my third attempt bootstrapping the node so I tried several tricks
> that might partially explain the output I am posting.
>
> * To make the bootstrap incremental, I have been throttling the streams on
> all nodes to 1Mbits. I have selectively unthrottling one node at a time
> hoping that would unlock some routines compacting away redundant data
> (you'll see that nodetool netstats reports back fewer nodes than nodetool
> status).
> * Since compactions have had the tendency of getting stuck (hundreds
> pending but none executing) in previous bootstraps, I've tried issuing a
> manual "nodetool compact" on the boostrapping node.
>
> Having said that, this is the output of the commands,
>
> Thanks a lot,
> Stefano
>
> *nodetool status*
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address  Load   Tokens   OwnsHost ID
> Rack
> UN  X.Y.33.8   342.4 GB   256  ?
> afaae414-30cc-439d-9785-1b7d35f74529  RAC1
> UN  X.Y.81.4   325.98 GB  256  ?
> 00a96a5d-3bfd-497f-91f3-973b75146162  RAC2
> UN  X.Y.33.4   348.81 GB  256  ?
> 1d8e6588-e25b-456a-8f29-0dedc35bda8e  RAC1
> UN  X.Y.33.5   384.99 GB  256  ?
> 13d03fd2-7528-466b-b4b5-1b46508e2465  RAC1
> UN  X.Y.81.5   336.27 GB  256  ?
> aa161400-6c0e-4bde-bcb3-b2e7e7840196  RAC2
> UN  X.Y.33.6   377.22 GB  256  ?
> 43a393ba-6805-4e33-866f-124360174b28  RAC1
> UN  X.Y.81.6   329.61 GB  256  ?
> 4c3c64ae-ef4f-4986-9341-573830416997  RAC2
> UN  X.Y.33.7   344.25 GB  256  ?
> 03d81879-dc0d-4118-92e3-b3013dfde480  RAC1
> UN  X.Y.81.7   324.93 GB  256  ?
> 24bbf4b6-9427-4ed1-a751-a55cc24cc756  RAC2
> UN  X.Y.81.1   323.8 GB   256  ?
> 26244100-0565-4567-ae9c-0fc5346f5558  RAC2
> UJ  X.Y.177.2  724.5 GB   256  ?
> e269a06b-c0c0-43a6-922c-f04c98898e0d  RAC3
> UN  X.Y.81.2   337.83 GB  256  ?
> 09e29429-15ff-44d6-9742-ac95c83c4d9e  RAC2
> UN  X.Y.81.3   326.4 GB   256  ?
> feaa7b27-7ab8-4fc2-b64a-c9df3dd86d97  RAC2
> UN  X.Y.33.3   350.4 GB   256  ?
> cc115991-b7e7-4d06-87b5-8ad5efd45da5  RAC1
>
>
> *nodetool netstats -H | grep "Already received" -B 1*
> /X.Y.81.4
> Receiving 1992 files, 103.68 GB total. Already received 515 files,
> 23.32 GB total
> --
> /X.Y.81.7
> Receiving 1936 files, 89.35 GB total. Already received 554 files,
> 23.32 GB total
> --
> /X.Y.81.5
> Receiving 1926 files, 95.69 GB total. Already received 545 files,
> 23.31 GB total
> --
> /X.Y.81.2
> Receiving 1992 files, 100.81 GB total. Already received 537 files,
> 23.32 GB total
> --
> /X.Y.81.3
> Receiving 1958 files, 104.72 GB total. Already received 503 files,
> 23.31 GB total
> --
> /X.Y.81.1
> Receiving 2034 files, 104.51 GB total. Already received 520 files,
> 23.33 GB total
> --
> /X.Y.81.6
> Receiving 1962 files, 96.19 GB total. Already received 547 files,
> 23.32 GB total
> --
> /X.Y.33.5
> Receiving 2121 files, 97.44 GB total. Already received 601 files,
> 23.32 GB total
>
> *nodetool tpstats*
> Pool NameActive   Pending  Completed   Blocked
>  All time blocked
> MutationStage 0 0  828367015 0
> 0
> ViewMutationStage 0 0  0 0
> 0
> ReadStage 0 0  0 0
> 0
> RequestResponseStage  0 0 13 0
> 0
> ReadRepairStage   0 0  0 0
> 0
> CounterMutationStage  0 0  0 0
> 0
> MiscStage 0 0  0  

Re: Bootstrapping a node fails because of compactions not keeping up

2017-10-15 Thread Stefano Ortolani
  0
HINT 0
MUTATION 1
COUNTER_MUTATION 0
BATCH_STORE  0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE  0
READ_REPAIR  0

*nodetool compactionstats -H*
pending tasks: 776
 id   compaction type keyspace
  table   completed totalunit   progress
   24d039f2-b1e6-11e7-ac57-3d25e38b2f5cCompaction   keyspace_1
table_1 4.85 GB   7.67 GB   bytes 63.25%
Active compaction remaining time :n/a


On Sun, Oct 15, 2017 at 9:27 PM, Jeff Jirsa <jji...@gmail.com> wrote:

> Can you post (anonymize as needed) nodetool status, nodetool netstats,
> nodetool tpstats, and nodetool compctionstats ?
>
> --
> Jeff Jirsa
>
>
> On Oct 15, 2017, at 1:14 PM, Stefano Ortolani <ostef...@gmail.com> wrote:
>
> Hi Jeff,
>
> that would be 3.0.15, single disk, vnodes enabled (num_tokens 256).
>
> Stefano
>
> On Sun, Oct 15, 2017 at 9:11 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>
>> What version?
>>
>> Single disk or JBOD?
>>
>> Vnodes?
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Oct 15, 2017, at 12:49 PM, Stefano Ortolani <ostef...@gmail.com>
>> wrote:
>>
>> Hi all,
>>
>> I have been trying "-Dcassandra.disable_stcs_in_l0=true", but no luck so
>> far.
>> Based on the source code it seems that this option doesn't affect
>> compactions while bootstrapping.
>>
>> I am getting quite confused as it seems I am not able to bootstrap a node
>> if I don't have at least 6/7 times the disk space used by other nodes.
>> This is weird. The host I am bootstrapping is using a SSD. Also
>> compaction throughput is unthrottled (set to 0) and the compacting threads
>> are set to 8.
>> Nevertheless, primary ranges from other nodes are being streamed, but
>> data is never compacted away.
>>
>> Does anybody know anything else I could try?
>>
>> Cheers,
>> Stefano
>>
>> On Fri, Oct 13, 2017 at 3:58 PM, Stefano Ortolani <ostef...@gmail.com>
>> wrote:
>>
>>> Other little update: at the same time I see the number of pending tasks
>>> stuck (in this case at 1847); restarting the node doesn't help, so I can't
>>> really force the node to "digest" all those compactions. In the meanwhile
>>> the disk occupied is already twice the average load I have on other nodes.
>>>
>>> Feeling more and more puzzled here :S
>>>
>>> On Fri, Oct 13, 2017 at 1:28 PM, Stefano Ortolani <ostef...@gmail.com>
>>> wrote:
>>>
>>>> I have been trying to add another node to the cluster (after upgrading
>>>> to 3.0.15) and I just noticed through "nodetool netstats" that all nodes
>>>> have been streaming to the joining node approx 1/3 of their SSTables,
>>>> basically their whole primary range (using RF=3)?
>>>>
>>>> Is this expected/normal?
>>>> I was under the impression only the necessary SSTables were going to be
>>>> streamed...
>>>>
>>>> Thanks for the help,
>>>> Stefano
>>>>
>>>>
>>>> On Wed, Aug 23, 2017 at 1:37 PM, kurt greaves <k...@instaclustr.com>
>>>> wrote:
>>>>
>>>>> But if it also streams, it means I'd still be under-pressure if I am
>>>>>> not mistaken. I am under the assumption that the compactions are the
>>>>>> by-product of streaming too many SStables at the same time, and not 
>>>>>> because
>>>>>> of my current write load.
>>>>>>
>>>>> Ah yeah I wasn't thinking about the capacity problem, more of the
>>>>> performance impact from the node being backed up with compactions. If you
>>>>> haven't already, you should try disable stcs in l0 on the joining node. 
>>>>> You
>>>>> will likely still need to do a lot of compactions, but generally they
>>>>> should be smaller. The  option is -Dcassandra.disable_stcs_in_l0=true
>>>>>
>>>>>>  I just noticed you were mentioning L1 tables too. Why would that
>>>>>> affect the disk footprint?
>>>>>
>>>>> If you've been doing a lot of STCS in L0, you generally end up with
>>>>> some large SSTables. These will eventually have to be compacted with L1.
>>>>> Could also be suffering the problem of streamed SSTables causing large
>>>>> cross-level compactions in the higher levels as well.
>>>>> ​
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Bootstrapping a node fails because of compactions not keeping up

2017-10-15 Thread Stefano Ortolani
Hi Jeff,

that would be 3.0.15, single disk, vnodes enabled (num_tokens 256).

Stefano

On Sun, Oct 15, 2017 at 9:11 PM, Jeff Jirsa <jji...@gmail.com> wrote:

> What version?
>
> Single disk or JBOD?
>
> Vnodes?
>
> --
> Jeff Jirsa
>
>
> On Oct 15, 2017, at 12:49 PM, Stefano Ortolani <ostef...@gmail.com> wrote:
>
> Hi all,
>
> I have been trying "-Dcassandra.disable_stcs_in_l0=true", but no luck so
> far.
> Based on the source code it seems that this option doesn't affect
> compactions while bootstrapping.
>
> I am getting quite confused as it seems I am not able to bootstrap a node
> if I don't have at least 6/7 times the disk space used by other nodes.
> This is weird. The host I am bootstrapping is using a SSD. Also compaction
> throughput is unthrottled (set to 0) and the compacting threads are set to
> 8.
> Nevertheless, primary ranges from other nodes are being streamed, but data
> is never compacted away.
>
> Does anybody know anything else I could try?
>
> Cheers,
> Stefano
>
> On Fri, Oct 13, 2017 at 3:58 PM, Stefano Ortolani <ostef...@gmail.com>
> wrote:
>
>> Other little update: at the same time I see the number of pending tasks
>> stuck (in this case at 1847); restarting the node doesn't help, so I can't
>> really force the node to "digest" all those compactions. In the meanwhile
>> the disk occupied is already twice the average load I have on other nodes.
>>
>> Feeling more and more puzzled here :S
>>
>> On Fri, Oct 13, 2017 at 1:28 PM, Stefano Ortolani <ostef...@gmail.com>
>> wrote:
>>
>>> I have been trying to add another node to the cluster (after upgrading
>>> to 3.0.15) and I just noticed through "nodetool netstats" that all nodes
>>> have been streaming to the joining node approx 1/3 of their SSTables,
>>> basically their whole primary range (using RF=3)?
>>>
>>> Is this expected/normal?
>>> I was under the impression only the necessary SSTables were going to be
>>> streamed...
>>>
>>> Thanks for the help,
>>> Stefano
>>>
>>>
>>> On Wed, Aug 23, 2017 at 1:37 PM, kurt greaves <k...@instaclustr.com>
>>> wrote:
>>>
>>>> But if it also streams, it means I'd still be under-pressure if I am
>>>>> not mistaken. I am under the assumption that the compactions are the
>>>>> by-product of streaming too many SStables at the same time, and not 
>>>>> because
>>>>> of my current write load.
>>>>>
>>>> Ah yeah I wasn't thinking about the capacity problem, more of the
>>>> performance impact from the node being backed up with compactions. If you
>>>> haven't already, you should try disable stcs in l0 on the joining node. You
>>>> will likely still need to do a lot of compactions, but generally they
>>>> should be smaller. The  option is -Dcassandra.disable_stcs_in_l0=true
>>>>
>>>>>  I just noticed you were mentioning L1 tables too. Why would that
>>>>> affect the disk footprint?
>>>>
>>>> If you've been doing a lot of STCS in L0, you generally end up with
>>>> some large SSTables. These will eventually have to be compacted with L1.
>>>> Could also be suffering the problem of streamed SSTables causing large
>>>> cross-level compactions in the higher levels as well.
>>>> ​
>>>>
>>>
>>>
>>
>


Re: Bootstrapping a node fails because of compactions not keeping up

2017-10-15 Thread Stefano Ortolani
Hi all,

I have been trying "-Dcassandra.disable_stcs_in_l0=true", but no luck so
far.
Based on the source code it seems that this option doesn't affect
compactions while bootstrapping.

I am getting quite confused as it seems I am not able to bootstrap a node
if I don't have at least 6/7 times the disk space used by other nodes.
This is weird. The host I am bootstrapping is using a SSD. Also compaction
throughput is unthrottled (set to 0) and the compacting threads are set to
8.
Nevertheless, primary ranges from other nodes are being streamed, but data
is never compacted away.

Does anybody know anything else I could try?

Cheers,
Stefano

On Fri, Oct 13, 2017 at 3:58 PM, Stefano Ortolani <ostef...@gmail.com>
wrote:

> Other little update: at the same time I see the number of pending tasks
> stuck (in this case at 1847); restarting the node doesn't help, so I can't
> really force the node to "digest" all those compactions. In the meanwhile
> the disk occupied is already twice the average load I have on other nodes.
>
> Feeling more and more puzzled here :S
>
> On Fri, Oct 13, 2017 at 1:28 PM, Stefano Ortolani <ostef...@gmail.com>
> wrote:
>
>> I have been trying to add another node to the cluster (after upgrading to
>> 3.0.15) and I just noticed through "nodetool netstats" that all nodes have
>> been streaming to the joining node approx 1/3 of their SSTables, basically
>> their whole primary range (using RF=3)?
>>
>> Is this expected/normal?
>> I was under the impression only the necessary SSTables were going to be
>> streamed...
>>
>> Thanks for the help,
>> Stefano
>>
>>
>> On Wed, Aug 23, 2017 at 1:37 PM, kurt greaves <k...@instaclustr.com>
>> wrote:
>>
>>> But if it also streams, it means I'd still be under-pressure if I am not
>>>> mistaken. I am under the assumption that the compactions are the by-product
>>>> of streaming too many SStables at the same time, and not because of my
>>>> current write load.
>>>>
>>> Ah yeah I wasn't thinking about the capacity problem, more of the
>>> performance impact from the node being backed up with compactions. If you
>>> haven't already, you should try disable stcs in l0 on the joining node. You
>>> will likely still need to do a lot of compactions, but generally they
>>> should be smaller. The  option is -Dcassandra.disable_stcs_in_l0=true
>>>
>>>>  I just noticed you were mentioning L1 tables too. Why would that
>>>> affect the disk footprint?
>>>
>>> If you've been doing a lot of STCS in L0, you generally end up with some
>>> large SSTables. These will eventually have to be compacted with L1. Could
>>> also be suffering the problem of streamed SSTables causing large
>>> cross-level compactions in the higher levels as well.
>>> ​
>>>
>>
>>
>


Re: Bootstrapping a node fails because of compactions not keeping up

2017-10-13 Thread Stefano Ortolani
Other little update: at the same time I see the number of pending tasks
stuck (in this case at 1847); restarting the node doesn't help, so I can't
really force the node to "digest" all those compactions. In the meanwhile
the disk occupied is already twice the average load I have on other nodes.

Feeling more and more puzzled here :S

On Fri, Oct 13, 2017 at 1:28 PM, Stefano Ortolani <ostef...@gmail.com>
wrote:

> I have been trying to add another node to the cluster (after upgrading to
> 3.0.15) and I just noticed through "nodetool netstats" that all nodes have
> been streaming to the joining node approx 1/3 of their SSTables, basically
> their whole primary range (using RF=3)?
>
> Is this expected/normal?
> I was under the impression only the necessary SSTables were going to be
> streamed...
>
> Thanks for the help,
> Stefano
>
>
> On Wed, Aug 23, 2017 at 1:37 PM, kurt greaves <k...@instaclustr.com>
> wrote:
>
>> But if it also streams, it means I'd still be under-pressure if I am not
>>> mistaken. I am under the assumption that the compactions are the by-product
>>> of streaming too many SStables at the same time, and not because of my
>>> current write load.
>>>
>> Ah yeah I wasn't thinking about the capacity problem, more of the
>> performance impact from the node being backed up with compactions. If you
>> haven't already, you should try disable stcs in l0 on the joining node. You
>> will likely still need to do a lot of compactions, but generally they
>> should be smaller. The  option is -Dcassandra.disable_stcs_in_l0=true
>>
>>>  I just noticed you were mentioning L1 tables too. Why would that affect
>>> the disk footprint?
>>
>> If you've been doing a lot of STCS in L0, you generally end up with some
>> large SSTables. These will eventually have to be compacted with L1. Could
>> also be suffering the problem of streamed SSTables causing large
>> cross-level compactions in the higher levels as well.
>> ​
>>
>
>


Re: Bootstrapping a node fails because of compactions not keeping up

2017-10-13 Thread Stefano Ortolani
I have been trying to add another node to the cluster (after upgrading to
3.0.15) and I just noticed through "nodetool netstats" that all nodes have
been streaming to the joining node approx 1/3 of their SSTables, basically
their whole primary range (using RF=3)?

Is this expected/normal?
I was under the impression only the necessary SSTables were going to be
streamed...

Thanks for the help,
Stefano


On Wed, Aug 23, 2017 at 1:37 PM, kurt greaves  wrote:

> But if it also streams, it means I'd still be under-pressure if I am not
>> mistaken. I am under the assumption that the compactions are the by-product
>> of streaming too many SStables at the same time, and not because of my
>> current write load.
>>
> Ah yeah I wasn't thinking about the capacity problem, more of the
> performance impact from the node being backed up with compactions. If you
> haven't already, you should try disable stcs in l0 on the joining node. You
> will likely still need to do a lot of compactions, but generally they
> should be smaller. The  option is -Dcassandra.disable_stcs_in_l0=true
>
>>  I just noticed you were mentioning L1 tables too. Why would that affect
>> the disk footprint?
>
> If you've been doing a lot of STCS in L0, you generally end up with some
> large SSTables. These will eventually have to be compacted with L1. Could
> also be suffering the problem of streamed SSTables causing large
> cross-level compactions in the higher levels as well.
> ​
>


Re: Wide rows splitting

2017-09-18 Thread Stefano Ortolani
You might find this interesting:
https://medium.com/@foundev/synthetic-sharding-in-cassandra-to-deal-with-large-partitions-2124b2fd788b

Cheers,
Stefano

On Mon, Sep 18, 2017 at 5:07 AM, Adam Smith  wrote:

> Dear community,
>
> I have a table with inlinks to URLs, i.e. many URLs point to
> http://google.com, less URLs point to http://somesmallweb.page.
>
> It has very wide and very skinny rows - the distribution is following a
> power law. I do not know a priori how many columns a row has. Also, I can't
> identify a schema to introduce a good partitioning.
>
> Currently, I am thinking about introducing splits by: pk is like (URL,
> splitnumber), where splitnumber is initially 1 and  hash URL mod
> splitnumber would determine the splitnumber on insert. I would need a
> separate table to maintain the splitnumber and a spark-cassandra-connector
> job counts the columns and and increases/doubles the number of splits on
> demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1)
> when splitnumber would be 2.
>
> Would you do the same? Is there a better way?
>
> Thanks!
> Adam
>


Re: Bootstrapping a node fails because of compactions not keeping up

2017-08-23 Thread Stefano Ortolani
Hi Kurt,

On Wed, Aug 23, 2017 at 11:32 AM, kurt greaves  wrote:

>
> ​1) You mean restarting the node in the middle of the bootstrap with
>> join_ring=false? Would this option require me to issue a nodetool boostrap
>> resume, correct? I didn't know you could instruct the join via JMX. Would
>> it be the same of the nodetool boostrap command?
>
> write_survey is slightly different to join_ring. TBH I haven't used
> write_survey myself for this purpose, but I believe it should work. The
> code certainly indicates that it will bootstrap and stream, but just not
> join the ring until you trigger a joinRing with JMX. (You'll have to look
> up the specific JMX call to make, there is one somewhere...)
>

But if it also streams, it means I'd still be under-pressure if I am not
mistaken. I am under the assumption that the compactions are the by-product
of streaming too many SStables at the same time, and not because of my
current write load.


>
> 2) Yes, they are streamed into L0 as far as I can see (although they are
>> mainly L2/L3 on the existing nodes). I don't think I can do much about it
>> as long as I am using LCS, am I right?
>
>  Code seems to imply that it should always keep SSTable levels. Sounds
> like something is wrong if it is not :/.
>

I don't have the logs anymore (machine had other issues, replacing disks)
so I can't be sure of this. Definitely I remember a high volume of
"Bootstrapping - doing STCS in L0" messages, which alone do not imply
though that all sstables were streamed at L0, so I can't be sure of this. I
just noticed you were mentioning L1 tables too. Why would that affect the
disk footprint?

Thanks!
Stefano


Re: Bootstrapping a node fails because of compactions not keeping up

2017-08-23 Thread Stefano Ortolani
Hi Kurt,

1) You mean restarting the node in the middle of the bootstrap with
join_ring=false? Would this option require me to issue a nodetool boostrap
resume, correct? I didn't know you could instruct the join via JMX. Would
it be the same of the nodetool boostrap command?
2) Yes, they are streamed into L0 as far as I can see (although they are
mainly L2/L3 on the existing nodes). I don't think I can do much about it
as long as I am using LCS, am I right?

Thanks for the help, much appreciated!

Cheers,
Stefano

On Wed, Aug 23, 2017 at 10:52 AM, kurt greaves  wrote:

> Well, that sucks. Be interested if you could find out if any of the
> streamed SSTables are retaining their levels.
> To answer your questions:
> 1) No.  However, you could set your nodes to join in write_survey mode,
> which will stop them from joining the ring and you can initiate the join
> over JMX when you are ready (once compactions catch up).
> 2) A 1TB compaction might be completely plausible if you are using
> compression and lots of SSTables are being streamed into L0 and triggering
> STCS compactions, or being compacted with L1.
>
> ​
>


Re: Bootstrapping a node fails because of compactions not keeping up

2017-08-23 Thread Stefano Ortolani
Hi Kurt,
sorry, I forgot to specify. I am on 3.0.14.

Cheers,
Stefano

On Wed, Aug 23, 2017 at 12:11 AM, kurt greaves  wrote:

> What version are you running? 2.2 has an improvement that will retain
> levels when streaming and this shouldn't really happen. If you're on 2.1
> best bet is to upgrade
>


Bootstrapping a node fails because of compactions not keeping up

2017-08-22 Thread Stefano Ortolani
Hi all,

I am trying to bootstrap a node without success due to running out of space.
Average node size is 260GB with lots of LCS tables (overall data ~3.5 TB)
Each node is also configured with a 1TB disk, including the bootstrapping
node.

After 12 hours the bootstrapping node fails with more than 2000 compactions
pending.
I have tried to increase the compaction threads, un-throttle the
throughput, but no luck.
Also I've tried reduce the streaming throughput (on all nodes this) as much
as possible (1 Mb/sec).

The thing is that even if I manage to reduce the streaming throughput,
compactions are still piling up  leaving me with no other choice than
pausing the whole streaming process altogether, which AFAIK is not possible
:/

I have searched the mailing list and this
https://groups.google.com/forum/#!topic/nosql-databases/GbcFMUUJ7XU reminds
me of my current situation. Unfortunately I am still left with the
following two questions:

1) Is it possible to pause all streams so to give the boostrapping node
enough time to catch up?
2) I don't understand why the disk fills up so fast. Considering the amount
of LCS tables I was even ready to blame LCS and the fact that the first
compaction at L0 is done with STCS, but 1 TB is way more than twice the
amount of data the node should own in theory, so something else might be
responsible for the over streaming.

Thanks in advance!
Stefano Ortolani


Re: Is it safe to upgrade 2.2.6 to 3.0.13?

2017-05-20 Thread Stefano Ortolani
Hi Varun,

can you elaborate a bit more? I have seen a schema change being pushed, but
that was just the first restart. After that, everything has been smooth so
far (including several schema changes, all of them verified via "describe").

Thanks!
Stefano

On Sat, May 20, 2017 at 2:10 AM, Varun Gupta <var...@uber.com> wrote:

> We upgraded from 2.2.5 to 3.0.11 and it works fine. I will suggest not to
> go with 3.013, we are seeing some issues with schema mismatch due to which
> we had to rollback to 3.0.11.
>
> Thanks,
> Varun
>
> On May 19, 2017, at 7:43 AM, Stefano Ortolani <ostef...@gmail.com> wrote:
>
> Here (https://github.com/apache/cassandra/blob/cassandra-3.0/NEWS.txt) is
> stated that the minimum supported version for the 2.2.X branch is 2.2.2.
>
> On Fri, May 19, 2017 at 2:16 PM, Nicolas Guyomar <
> nicolas.guyo...@gmail.com> wrote:
>
>> Hi Xihui,
>>
>> I was looking for this documentation also, but I believe datastax removed
>> it, and it is not available yet on the apache website
>>
>> As far as I remember, intermediate version was needed if  C* Version <
>> 2.1.7.
>>
>> You should be safe starting from 2.2.6, but testing the upgrade on a
>> dedicated platform is always a good idea.
>>
>> Nicolas
>>
>> On 19 May 2017 at 09:02, Xihui He <xihu...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> We are planning to upgrade our production cluster to 3.x, but I can't
>>> find the upgrade guide anymore.
>>> Can I upgrade to 3.0.13 from 2.2.6 directly? Is a interim version
>>> necessary?
>>>
>>> Thanks,
>>> Xihui
>>>
>>
>>
>


Re: Is it safe to upgrade 2.2.6 to 3.0.13?

2017-05-19 Thread Stefano Ortolani
Here (https://github.com/apache/cassandra/blob/cassandra-3.0/NEWS.txt) is
stated that the minimum supported version for the 2.2.X branch is 2.2.2.

On Fri, May 19, 2017 at 2:16 PM, Nicolas Guyomar 
wrote:

> Hi Xihui,
>
> I was looking for this documentation also, but I believe datastax removed
> it, and it is not available yet on the apache website
>
> As far as I remember, intermediate version was needed if  C* Version <
> 2.1.7.
>
> You should be safe starting from 2.2.6, but testing the upgrade on a
> dedicated platform is always a good idea.
>
> Nicolas
>
> On 19 May 2017 at 09:02, Xihui He  wrote:
>
>> Hi All,
>>
>> We are planning to upgrade our production cluster to 3.x, but I can't
>> find the upgrade guide anymore.
>> Can I upgrade to 3.0.13 from 2.2.6 directly? Is a interim version
>> necessary?
>>
>> Thanks,
>> Xihui
>>
>
>


Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
But it should skip those records since they are sorted. My understanding
would be something like:

1) read sstable 2
2) read the range tombstone
3) skip records from sstable2 and sstable1 within the range boundaries
4) read remaining records from sstable1
5) no records, return

On Tue, May 16, 2017 at 5:43 PM, Hannu Kröger <hkro...@gmail.com> wrote:

> This is a bit of guessing but it probably reads sstables in some sort of
> sequence, so even if sstable 2 contains the tombstone, it still scans
> through the sstable 1 for possible data to be read.
>
> BR,
> Hannu
>
> On 16 May 2017, at 19:40, Stefano Ortolani <ostef...@gmail.com> wrote:
>
> Little update: also the following query timeouts, which is weird since the
> range tombstone should have been read by then...
>
> SELECT *
> FROM test_cql.test_cf
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> AND timeid < the_oldest_deleted_timeid
> ORDER BY timeid DESC;
>
>
>
> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostef...@gmail.com>
> wrote:
>
>> Yes, that was my intention but I wanted to cross-check with the ML and
>> the devs keeping an eye on it first.
>>
>> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkro...@gmail.com> wrote:
>>
>>> Well,
>>>
>>> sstables contain some statistics about the cell timestamps and using
>>> that information and the tombstone timestamp it might be possible to skip
>>> some data but I’m not sure that Cassandra currently does that. Maybe it
>>> would be worth a JIRA ticket and see what the devs think about it. If
>>> optimizing this case would make sense.
>>>
>>> Hannu
>>>
>>> On 16 May 2017, at 18:03, Stefano Ortolani <ostef...@gmail.com> wrote:
>>>
>>> Hi Hannu,
>>>
>>> the piece of data in question is older. In my example the tombstone is
>>> the newest piece of data.
>>> Since a range tombstone has information re the clustering key ranges,
>>> and the data is clustering key sorted, I would expect a linear scan not to
>>> be necessary.
>>>
>>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com> wrote:
>>>
>>>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>>>> skip bigger regions of deleted data based on range tombstone. If some piece
>>>> of data in a partition is newer than the tombstone, then it cannot be
>>>> skipped. Therefore some partition level statistics of cell ages would need
>>>> to be kept in the column index for the skipping and that is probably not
>>>> there.
>>>>
>>>> Hannu
>>>>
>>>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com> wrote:
>>>>
>>>> That is another way to see the question: are reverse iterators range
>>>> tombstone aware? Yes.
>>>> That is why I am puzzled by this afore-mentioned behavior.
>>>> I would expect them to handle this case more gracefully.
>>>>
>>>> Cheers,
>>>> Stefano
>>>>
>>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com>
>>>> wrote:
>>>>
>>>>> Hannu,
>>>>>
>>>>> How can you read a partition in reverse?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com> wrote:
>>>>> >
>>>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>>>> tombstone is useful for this or not.
>>>>> >
>>>>> > In many cases it might be that the partition contains data that is
>>>>> within the range of the tombstone but is newer than the tombstone and
>>>>> therefore it might be still be returned. Scanning through deleted data can
>>>>> be avoided by reading the partition in reverse (if all the deleted data is
>>>>> in the beginning of the partition). Eventually you will still end up
>>>>> reading a lot of tombstones but you will get a lot of live data first and
>>>>> the implicit query limit of 1 probably is reached before you get to 
>>>>> the
>>>>> tombstones. Therefore you will get an immediate answer.
>>>>> >
>>>>> > Does it make sense?
>>>>> >
>>>>> > Hannu
>>>>> >
>>>>> >> On 16 May 2017, at 16:33, Stefa

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
Little update: also the following query timeouts, which is weird since the
range tombstone should have been read by then...

SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
AND timeid < the_oldest_deleted_timeid
ORDER BY timeid DESC;



On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostef...@gmail.com>
wrote:

> Yes, that was my intention but I wanted to cross-check with the ML and the
> devs keeping an eye on it first.
>
> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkro...@gmail.com> wrote:
>
>> Well,
>>
>> sstables contain some statistics about the cell timestamps and using that
>> information and the tombstone timestamp it might be possible to skip some
>> data but I’m not sure that Cassandra currently does that. Maybe it would be
>> worth a JIRA ticket and see what the devs think about it. If optimizing
>> this case would make sense.
>>
>> Hannu
>>
>> On 16 May 2017, at 18:03, Stefano Ortolani <ostef...@gmail.com> wrote:
>>
>> Hi Hannu,
>>
>> the piece of data in question is older. In my example the tombstone is
>> the newest piece of data.
>> Since a range tombstone has information re the clustering key ranges, and
>> the data is clustering key sorted, I would expect a linear scan not to be
>> necessary.
>>
>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com> wrote:
>>
>>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>>> skip bigger regions of deleted data based on range tombstone. If some piece
>>> of data in a partition is newer than the tombstone, then it cannot be
>>> skipped. Therefore some partition level statistics of cell ages would need
>>> to be kept in the column index for the skipping and that is probably not
>>> there.
>>>
>>> Hannu
>>>
>>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com> wrote:
>>>
>>> That is another way to see the question: are reverse iterators range
>>> tombstone aware? Yes.
>>> That is why I am puzzled by this afore-mentioned behavior.
>>> I would expect them to handle this case more gracefully.
>>>
>>> Cheers,
>>> Stefano
>>>
>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com> wrote:
>>>
>>>> Hannu,
>>>>
>>>> How can you read a partition in reverse?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com> wrote:
>>>> >
>>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>>> tombstone is useful for this or not.
>>>> >
>>>> > In many cases it might be that the partition contains data that is
>>>> within the range of the tombstone but is newer than the tombstone and
>>>> therefore it might be still be returned. Scanning through deleted data can
>>>> be avoided by reading the partition in reverse (if all the deleted data is
>>>> in the beginning of the partition). Eventually you will still end up
>>>> reading a lot of tombstones but you will get a lot of live data first and
>>>> the implicit query limit of 1 probably is reached before you get to the
>>>> tombstones. Therefore you will get an immediate answer.
>>>> >
>>>> > Does it make sense?
>>>> >
>>>> > Hannu
>>>> >
>>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Hi all,
>>>> >>
>>>> >> I am seeing inconsistencies when mixing range tombstones, wide
>>>> partitions, and reverse iterators.
>>>> >> I still have to understand if the behaviour is to be expected hence
>>>> the message on the mailing list.
>>>> >>
>>>> >> The situation is conceptually simple. I am using a table defined as
>>>> follows:
>>>> >>
>>>> >> CREATE TABLE test_cql.test_cf (
>>>> >>  hash blob,
>>>> >>  timeid timeuuid,
>>>> >>  PRIMARY KEY (hash, timeid)
>>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>>> >>
>>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
>>&g

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
Yes, that was my intention but I wanted to cross-check with the ML and the
devs keeping an eye on it first.

On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkro...@gmail.com> wrote:

> Well,
>
> sstables contain some statistics about the cell timestamps and using that
> information and the tombstone timestamp it might be possible to skip some
> data but I’m not sure that Cassandra currently does that. Maybe it would be
> worth a JIRA ticket and see what the devs think about it. If optimizing
> this case would make sense.
>
> Hannu
>
> On 16 May 2017, at 18:03, Stefano Ortolani <ostef...@gmail.com> wrote:
>
> Hi Hannu,
>
> the piece of data in question is older. In my example the tombstone is the
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and
> the data is clustering key sorted, I would expect a linear scan not to be
> necessary.
>
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com> wrote:
>
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>> skip bigger regions of deleted data based on range tombstone. If some piece
>> of data in a partition is newer than the tombstone, then it cannot be
>> skipped. Therefore some partition level statistics of cell ages would need
>> to be kept in the column index for the skipping and that is probably not
>> there.
>>
>> Hannu
>>
>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com> wrote:
>>
>> That is another way to see the question: are reverse iterators range
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior.
>> I would expect them to handle this case more gracefully.
>>
>> Cheers,
>> Stefano
>>
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com> wrote:
>>
>>> Hannu,
>>>
>>> How can you read a partition in reverse?
>>>
>>> Sent from my iPhone
>>>
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com> wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>> tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is
>>> within the range of the tombstone but is newer than the tombstone and
>>> therefore it might be still be returned. Scanning through deleted data can
>>> be avoided by reading the partition in reverse (if all the deleted data is
>>> in the beginning of the partition). Eventually you will still end up
>>> reading a lot of tombstones but you will get a lot of live data first and
>>> the implicit query limit of 1 probably is reached before you get to the
>>> tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide
>>> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence
>>> the message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as
>>> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
>>> a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>>> _half_ of that partition by executing the query below, and restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes
>>> more than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
No, because C* has reverse iterators.

On Tue, May 16, 2017 at 4:47 PM, Nitan Kainth <ni...@bamlabs.com> wrote:

> If the data is stored in ASC order and query asks for DESC, then wouldn’t
> it read whole partition in first and then pick data from reverse order?
>
>
> On May 16, 2017, at 10:03 AM, Stefano Ortolani <ostef...@gmail.com> wrote:
>
> Hi Hannu,
>
> the piece of data in question is older. In my example the tombstone is the
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and
> the data is clustering key sorted, I would expect a linear scan not to be
> necessary.
>
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com> wrote:
>
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>> skip bigger regions of deleted data based on range tombstone. If some piece
>> of data in a partition is newer than the tombstone, then it cannot be
>> skipped. Therefore some partition level statistics of cell ages would need
>> to be kept in the column index for the skipping and that is probably not
>> there.
>>
>> Hannu
>>
>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com> wrote:
>>
>> That is another way to see the question: are reverse iterators range
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior.
>> I would expect them to handle this case more gracefully.
>>
>> Cheers,
>> Stefano
>>
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com> wrote:
>>
>>> Hannu,
>>>
>>> How can you read a partition in reverse?
>>>
>>> Sent from my iPhone
>>>
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com> wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>> tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is
>>> within the range of the tombstone but is newer than the tombstone and
>>> therefore it might be still be returned. Scanning through deleted data can
>>> be avoided by reading the partition in reverse (if all the deleted data is
>>> in the beginning of the partition). Eventually you will still end up
>>> reading a lot of tombstones but you will get a lot of live data first and
>>> the implicit query limit of 1 probably is reached before you get to the
>>> tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide
>>> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence
>>> the message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as
>>> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
>>> a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>>> _half_ of that partition by executing the query below, and restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes
>>> more than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>> >>
>>> >> While the following returns immediately (obviously because no deleted
>>> data is ever read):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid DESC;
>>> >>
>>> >> If I force a compaction the problem is gone, but I presume just
>>> because the data is rearranged.
>>> >>
>>> >> It seems to me that reading by ASC does not make use of the range
>>> tombstone until C* reads the
>>> >> last sstables (which actually contains the range tombstone and is
>>> flushed at node restart), and it wastes time reading all rows that are
>>> actually not live anymore.
>>> >>
>>> >> Is this expected? Should the range tombstone actually help in these
>>> cases?
>>> >>
>>> >> Thanks a lot!
>>> >> Stefano
>>> >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>>> >
>>>
>>
>>
>>
>
>


Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
Hi Hannu,

the piece of data in question is older. In my example the tombstone is the
newest piece of data.
Since a range tombstone has information re the clustering key ranges, and
the data is clustering key sorted, I would expect a linear scan not to be
necessary.

On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com> wrote:

> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip
> bigger regions of deleted data based on range tombstone. If some piece of
> data in a partition is newer than the tombstone, then it cannot be skipped.
> Therefore some partition level statistics of cell ages would need to be
> kept in the column index for the skipping and that is probably not there.
>
> Hannu
>
> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com> wrote:
>
> That is another way to see the question: are reverse iterators range
> tombstone aware? Yes.
> That is why I am puzzled by this afore-mentioned behavior.
> I would expect them to handle this case more gracefully.
>
> Cheers,
> Stefano
>
> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com> wrote:
>
>> Hannu,
>>
>> How can you read a partition in reverse?
>>
>> Sent from my iPhone
>>
>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com> wrote:
>> >
>> > Well, I’m guessing that Cassandra doesn't really know if the range
>> tombstone is useful for this or not.
>> >
>> > In many cases it might be that the partition contains data that is
>> within the range of the tombstone but is newer than the tombstone and
>> therefore it might be still be returned. Scanning through deleted data can
>> be avoided by reading the partition in reverse (if all the deleted data is
>> in the beginning of the partition). Eventually you will still end up
>> reading a lot of tombstones but you will get a lot of live data first and
>> the implicit query limit of 1 probably is reached before you get to the
>> tombstones. Therefore you will get an immediate answer.
>> >
>> > Does it make sense?
>> >
>> > Hannu
>> >
>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I am seeing inconsistencies when mixing range tombstones, wide
>> partitions, and reverse iterators.
>> >> I still have to understand if the behaviour is to be expected hence
>> the message on the mailing list.
>> >>
>> >> The situation is conceptually simple. I am using a table defined as
>> follows:
>> >>
>> >> CREATE TABLE test_cql.test_cf (
>> >>  hash blob,
>> >>  timeid timeuuid,
>> >>  PRIMARY KEY (hash, timeid)
>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> >>
>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a
>> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>> _half_ of that partition by executing the query below, and restart the node:
>> >>
>> >> DELETE
>> >> FROM test_cql.test_cf
>> >> WHERE hash = x AND timeid < y;
>> >>
>> >> If I keep compactions disabled the following query timeouts (takes
>> more than 10 seconds to
>> >> succeed):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid ASC;
>> >>
>> >> While the following returns immediately (obviously because no deleted
>> data is ever read):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid DESC;
>> >>
>> >> If I force a compaction the problem is gone, but I presume just
>> because the data is rearranged.
>> >>
>> >> It seems to me that reading by ASC does not make use of the range
>> tombstone until C* reads the
>> >> last sstables (which actually contains the range tombstone and is
>> flushed at node restart), and it wastes time reading all rows that are
>> actually not live anymore.
>> >>
>> >> Is this expected? Should the range tombstone actually help in these
>> cases?
>> >>
>> >> Thanks a lot!
>> >> Stefano
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>> >
>>
>
>
>


Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
That is another way to see the question: are reverse iterators range
tombstone aware? Yes.
That is why I am puzzled by this afore-mentioned behavior.
I would expect them to handle this case more gracefully.

Cheers,
Stefano

On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com> wrote:

> Hannu,
>
> How can you read a partition in reverse?
>
> Sent from my iPhone
>
> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com> wrote:
> >
> > Well, I’m guessing that Cassandra doesn't really know if the range
> tombstone is useful for this or not.
> >
> > In many cases it might be that the partition contains data that is
> within the range of the tombstone but is newer than the tombstone and
> therefore it might be still be returned. Scanning through deleted data can
> be avoided by reading the partition in reverse (if all the deleted data is
> in the beginning of the partition). Eventually you will still end up
> reading a lot of tombstones but you will get a lot of live data first and
> the implicit query limit of 1 probably is reached before you get to the
> tombstones. Therefore you will get an immediate answer.
> >
> > Does it make sense?
> >
> > Hannu
> >
> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> I am seeing inconsistencies when mixing range tombstones, wide
> partitions, and reverse iterators.
> >> I still have to understand if the behaviour is to be expected hence the
> message on the mailing list.
> >>
> >> The situation is conceptually simple. I am using a table defined as
> follows:
> >>
> >> CREATE TABLE test_cql.test_cf (
> >>  hash blob,
> >>  timeid timeuuid,
> >>  PRIMARY KEY (hash, timeid)
> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
> >>
> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a
> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
> _half_ of that partition by executing the query below, and restart the node:
> >>
> >> DELETE
> >> FROM test_cql.test_cf
> >> WHERE hash = x AND timeid < y;
> >>
> >> If I keep compactions disabled the following query timeouts (takes more
> than 10 seconds to
> >> succeed):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid ASC;
> >>
> >> While the following returns immediately (obviously because no deleted
> data is ever read):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid DESC;
> >>
> >> If I force a compaction the problem is gone, but I presume just because
> the data is rearranged.
> >>
> >> It seems to me that reading by ASC does not make use of the range
> tombstone until C* reads the
> >> last sstables (which actually contains the range tombstone and is
> flushed at node restart), and it wastes time reading all rows that are
> actually not live anymore.
> >>
> >> Is this expected? Should the range tombstone actually help in these
> cases?
> >>
> >> Thanks a lot!
> >> Stefano
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>


Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
Hi all,

I am seeing inconsistencies when mixing range tombstones, wide partitions,
and reverse iterators.
I still have to understand if the behaviour is to be expected hence the
message on the mailing list.

The situation is conceptually simple. I am using a table defined as follows:

CREATE TABLE test_cql.test_cf (
  hash blob,
  timeid timeuuid,
  PRIMARY KEY (hash, timeid)
) WITH CLUSTERING ORDER BY (timeid ASC)
  AND compaction = {'class' : 'LeveledCompactionStrategy'};

I then proceed by loading 2/3GB from 3 sstables which I know contain a
really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
_half_ of that partition by executing the query below, and restart the node:

DELETE
FROM test_cql.test_cf
WHERE hash = x AND timeid < y;

If I keep compactions disabled the following query timeouts (takes more
than 10 seconds to
succeed):

SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
ORDER BY timeid ASC;

While the following returns immediately (obviously because no deleted data
is ever read):

SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
ORDER BY timeid DESC;

If I force a compaction the problem is gone, but I presume just because the
data is rearranged.

It seems to me that reading by ASC does not make use of the range tombstone
until C* reads the
last sstables (which actually contains the range tombstone and is flushed
at node restart), and it wastes time reading all rows that are actually not
live anymore.

Is this expected? Should the range tombstone actually help in these cases?

Thanks a lot!
Stefano


Re: LCS, range tombstones, and eviction

2017-05-16 Thread Stefano Ortolani
That makes sense.
I see however some unexpected performance data on my test, but I will start
another thread for that.

Thanks again!

On Fri, May 12, 2017 at 6:56 PM, Blake Eggleston <beggles...@apple.com>
wrote:

> The start and end points of a range tombstone are basically stored as
> special purpose rows alongside the normal data in an sstable. As part of a
> read, they're reconciled with the data from the other sstables into a
> single partition, just like the other rows. The only difference is that
> they don't contain any 'real' data, and, of course, they prevent 'deleted'
> data from being returned in the read. It's a bit more complicated than
> that, but that's the general idea.
>
>
> On May 12, 2017 at 6:23:01 AM, Stefano Ortolani (ostef...@gmail.com)
> wrote:
>
> Thanks a lot Blake, that definitely helps!
>
> I actually found a ticket re range tombstones and how they are accounted
> for: https://issues.apache.org/jira/browse/CASSANDRA-8527
>
> I am wondering now what happens when a node receives a read request. Are
> the range tombstones read before scanning the SStables? More interestingly,
> given that a single partition might be split across different levels, and
> that some range tombstones might be in L0 while all the rest of the data in
> L1, are all the tombstones prefetched from _all_ the involved SStables
> before doing any table scan?
>
> Regards,
> Stefano
>
> On Thu, May 11, 2017 at 7:58 PM, Blake Eggleston <beggles...@apple.com>
> wrote:
>
>> Hi Stefano,
>>
>> Based on what I understood reading the docs, if the ratio of garbage
>> collectable tomstones exceeds the "tombstone_threshold", C* should start
>> compacting and evicting.
>>
>>
>> If there are no other normal compaction tasks to be run, LCS will attempt
>> to compact the sstables it estimates it will be able to drop the most
>> tombstones from. It does this by estimating the number of tombstones an
>> sstable has that have passed the gc grace period. Whether or not a
>> tombstone will actually be evicted is more complicated. Even if a tombstone
>> has passed gc grace, it can't be dropped if the data it's deleting still
>> exists in another sstable, otherwise the data would appear to return. So, a
>> tombstone won't be dropped if there is data for the same partition in other
>> sstables that is older than the tombstone being evaluated for eviction.
>>
>> I am quite puzzled however by what might happen when dealing with range
>> tombstones. In that case a single tombstone might actually stand for an
>> arbitrary number of normal tombstones. In other words, do range
>> tombstones
>> contribute to the "tombstone_threshold"? If so, how?
>>
>>
>> From what I can tell, each end of the range tombstone is counted as a
>> single tombstone tombstone. So a range tombstone effectively contributes
>> '2' to the count of tombstones for an sstable. I'm not 100% sure, but I
>> haven't seen any sstable writing logic that tracks open tombstones and
>> counts covered cells as tombstones. So, it's likely that the effect of
>> range tombstones covering many rows are under represented in the droppable
>> tombstone estimate.
>>
>> I am also a bit confused by the "tombstone_compaction_interval". If I am
>> dealing with a big partition in LCS which is receiving new records every
>> day,
>> and a weekly incremental repair job continously anticompacting the data
>> and
>> thus creating SStables, what is the likelhood of the default interval
>> (10 days) to be actually hit?
>>
>>
>> It will be hit, but probably only in the repaired data. Once the data is
>> marked repaired, it shouldn't be anticompacted again, and should get old
>> enough to pass the compaction interval. That shouldn't be an issue though,
>> because you should be running repair often enough that data is repaired
>> before it can ever get past the gc grace period. Otherwise you'll have
>> other problems. Also, keep in mind that tombstone eviction is a part of all
>> compactions, it's just that occasionally a compaction is run specifically
>> for that purpose. Finally, you probably shouldn't run incremental repair on
>> data that is deleted. There is a design flaw in the incremental repair used
>> in pre-4.0 of cassandra that can cause consistency issues. It can also
>> cause a *lot* of over streaming, so you might want to take a look at how
>> much streaming your cluster is doing with full repairs, and incremental
>> repairs. It might actually be more efficient to run full repairs.
>>
>> Hope that helps,
>>
>> 

Re: LCS, range tombstones, and eviction

2017-05-12 Thread Stefano Ortolani
Thanks a lot Blake, that definitely helps!

I actually found a ticket re range tombstones and how they are accounted
for: https://issues.apache.org/jira/browse/CASSANDRA-8527

I am wondering now what happens when a node receives a read request. Are
the range tombstones read before scanning the SStables? More interestingly,
given that a single partition might be split across different levels, and
that some range tombstones might be in L0 while all the rest of the data in
L1, are all the tombstones prefetched from _all_ the involved SStables
before doing any table scan?

Regards,
Stefano

On Thu, May 11, 2017 at 7:58 PM, Blake Eggleston <beggles...@apple.com>
wrote:

> Hi Stefano,
>
> Based on what I understood reading the docs, if the ratio of garbage
> collectable tomstones exceeds the "tombstone_threshold", C* should start
> compacting and evicting.
>
>
> If there are no other normal compaction tasks to be run, LCS will attempt
> to compact the sstables it estimates it will be able to drop the most
> tombstones from. It does this by estimating the number of tombstones an
> sstable has that have passed the gc grace period. Whether or not a
> tombstone will actually be evicted is more complicated. Even if a tombstone
> has passed gc grace, it can't be dropped if the data it's deleting still
> exists in another sstable, otherwise the data would appear to return. So, a
> tombstone won't be dropped if there is data for the same partition in other
> sstables that is older than the tombstone being evaluated for eviction.
>
> I am quite puzzled however by what might happen when dealing with range
> tombstones. In that case a single tombstone might actually stand for an
> arbitrary number of normal tombstones. In other words, do range tombstones
> contribute to the "tombstone_threshold"? If so, how?
>
>
> From what I can tell, each end of the range tombstone is counted as a
> single tombstone tombstone. So a range tombstone effectively contributes
> '2' to the count of tombstones for an sstable. I'm not 100% sure, but I
> haven't seen any sstable writing logic that tracks open tombstones and
> counts covered cells as tombstones. So, it's likely that the effect of
> range tombstones covering many rows are under represented in the droppable
> tombstone estimate.
>
> I am also a bit confused by the "tombstone_compaction_interval". If I am
> dealing with a big partition in LCS which is receiving new records every
> day,
> and a weekly incremental repair job continously anticompacting the data
> and
> thus creating SStables, what is the likelhood of the default interval
> (10 days) to be actually hit?
>
>
> It will be hit, but probably only in the repaired data. Once the data is
> marked repaired, it shouldn't be anticompacted again, and should get old
> enough to pass the compaction interval. That shouldn't be an issue though,
> because you should be running repair often enough that data is repaired
> before it can ever get past the gc grace period. Otherwise you'll have
> other problems. Also, keep in mind that tombstone eviction is a part of all
> compactions, it's just that occasionally a compaction is run specifically
> for that purpose. Finally, you probably shouldn't run incremental repair on
> data that is deleted. There is a design flaw in the incremental repair used
> in pre-4.0 of cassandra that can cause consistency issues. It can also
> cause a *lot* of over streaming, so you might want to take a look at how
> much streaming your cluster is doing with full repairs, and incremental
> repairs. It might actually be more efficient to run full repairs.
>
> Hope that helps,
>
> Blake
>
> On May 11, 2017 at 7:16:26 AM, Stefano Ortolani (ostef...@gmail.com)
> wrote:
>
> Hi all,
>
> I am trying to wrap my head around how C* evicts tombstones when using LCS.
> Based on what I understood reading the docs, if the ratio of garbage
> collectable tomstones exceeds the "tombstone_threshold", C* should start
> compacting and evicting.
>
> I am quite puzzled however by what might happen when dealing with range
> tombstones. In that case a single tombstone might actually stand for an
> arbitrary number of normal tombstones. In other words, do range tombstones
> contribute to the "tombstone_threshold"? If so, how?
>
> I am also a bit confused by the "tombstone_compaction_interval". If I am
> dealing with a big partition in LCS which is receiving new records every
> day,
> and a weekly incremental repair job continously anticompacting the data
> and
> thus creating SStables, what is the likelhood of the default interval
> (10 days) to be actually hit?
>
> Hopefully somebody will be able to shed some lights here!
>
> Thanks in advance!
> Stefano
>
>


LCS, range tombstones, and eviction

2017-05-11 Thread Stefano Ortolani
Hi all,

I am trying to wrap my head around how C* evicts tombstones when using LCS.
Based on what I understood reading the docs, if the ratio of garbage
collectable tomstones exceeds the "tombstone_threshold", C* should start
compacting and evicting.

I am quite puzzled however by what might happen when dealing with range
tombstones. In that case a single tombstone might actually stand for an
arbitrary number of normal tombstones. In other words, do range tombstones
contribute to the "tombstone_threshold"? If so, how?

I am also a bit confused by the "tombstone_compaction_interval". If I am
dealing with a big partition in LCS which is receiving new records every
day,
and a weekly incremental repair job continously anticompacting the data and
thus creating SStables, what is the likelhood of the default interval
(10 days) to be actually hit?

Hopefully somebody will be able to shed some lights here!

Thanks in advance!
Stefano


Re: Incremental repairs leading to unrepaired data

2016-11-01 Thread Stefano Ortolani
That is not happening anymore since I am repairing a keyspace with
much less data (the other one is still there in write-only mode).
The command I am using is the most boring (even shed the -pr option so
to keep anticompactions to a minimum): nodetool -h localhost repair

It's executed sequentially on each node (no overlapping, next node
waits for the previous to complete).

Regards,
Stefano Ortolani

On Mon, Oct 31, 2016 at 11:18 PM, kurt Greaves <k...@instaclustr.com> wrote:
> Blowing out to 1k SSTables seems a bit full on. What args are you passing to
> repair?
>
> Kurt Greaves
> k...@instaclustr.com
> www.instaclustr.com
>
> On 31 October 2016 at 09:49, Stefano Ortolani <ostef...@gmail.com> wrote:
>>
>> I've collected some more data-points, and I still see dropped
>> mutations with compaction_throughput_mb_per_sec set to 8.
>> The only notable thing regarding the current setup is that I have
>> another keyspace (not being repaired though) with really wide rows
>> (100MB per partition), but that shouldn't have any impact in theory.
>> Nodes do not seem that overloaded either and don't see any GC spikes
>> while those mutations are dropped :/
>>
>> Hitting a dead end here, any further idea where to look for further ideas?
>>
>> Regards,
>> Stefano
>>
>> On Wed, Aug 10, 2016 at 12:41 PM, Stefano Ortolani <ostef...@gmail.com>
>> wrote:
>> > That's what I was thinking. Maybe GC pressure?
>> > Some more details: during anticompaction I have some CFs exploding to 1K
>> > SStables (to be back to ~200 upon completion).
>> > HW specs should be quite good (12 cores/32 GB ram) but, I admit, still
>> > relying on spinning disks, with ~150GB per node.
>> > Current version is 3.0.8.
>> >
>> >
>> > On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta <pauloricard...@gmail.com>
>> > wrote:
>> >>
>> >> That's pretty low already, but perhaps you should lower to see if it
>> >> will
>> >> improve the dropped mutations during anti-compaction (even if it
>> >> increases
>> >> repair time), otherwise the problem might be somewhere else. Generally
>> >> dropped mutations is a signal of cluster overload, so if there's
>> >> nothing
>> >> else wrong perhaps you need to increase your capacity. What version are
>> >> you
>> >> in?
>> >>
>> >> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>> >>>
>> >>> Not yet. Right now I have it set at 16.
>> >>> Would halving it more or less double the repair time?
>> >>>
>> >>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pauloricard...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Anticompaction throttling can be done by setting the usual
>> >>>> compaction_throughput_mb_per_sec knob on cassandra.yaml or via
>> >>>> nodetool
>> >>>> setcompactionthroughput. Did you try lowering that  and checking if
>> >>>> that
>> >>>> improves the dropped mutations?
>> >>>>
>> >>>> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>> >>>>>
>> >>>>> Hi all,
>> >>>>>
>> >>>>> I am running incremental repaird on a weekly basis (can't do it
>> >>>>> every
>> >>>>> day as one single run takes 36 hours), and every time, I have at
>> >>>>> least one
>> >>>>> node dropping mutations as part of the process (this almost always
>> >>>>> during
>> >>>>> the anticompaction phase). Ironically this leads to a system where
>> >>>>> repairing
>> >>>>> makes data consistent at the cost of making some other data not
>> >>>>> consistent.
>> >>>>>
>> >>>>> Does anybody know why this is happening?
>> >>>>>
>> >>>>> My feeling is that this might be caused by anticompacting column
>> >>>>> families with really wide rows and with many SStables. If that is
>> >>>>> the case,
>> >>>>> any way I can throttle that?
>> >>>>>
>> >>>>> Thanks!
>> >>>>> Stefano
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>
>


Re: Incremental repairs leading to unrepaired data

2016-10-31 Thread Stefano Ortolani
I've collected some more data-points, and I still see dropped
mutations with compaction_throughput_mb_per_sec set to 8.
The only notable thing regarding the current setup is that I have
another keyspace (not being repaired though) with really wide rows
(100MB per partition), but that shouldn't have any impact in theory.
Nodes do not seem that overloaded either and don't see any GC spikes
while those mutations are dropped :/

Hitting a dead end here, any further idea where to look for further ideas?

Regards,
Stefano

On Wed, Aug 10, 2016 at 12:41 PM, Stefano Ortolani <ostef...@gmail.com> wrote:
> That's what I was thinking. Maybe GC pressure?
> Some more details: during anticompaction I have some CFs exploding to 1K
> SStables (to be back to ~200 upon completion).
> HW specs should be quite good (12 cores/32 GB ram) but, I admit, still
> relying on spinning disks, with ~150GB per node.
> Current version is 3.0.8.
>
>
> On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta <pauloricard...@gmail.com>
> wrote:
>>
>> That's pretty low already, but perhaps you should lower to see if it will
>> improve the dropped mutations during anti-compaction (even if it increases
>> repair time), otherwise the problem might be somewhere else. Generally
>> dropped mutations is a signal of cluster overload, so if there's nothing
>> else wrong perhaps you need to increase your capacity. What version are you
>> in?
>>
>> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>>>
>>> Not yet. Right now I have it set at 16.
>>> Would halving it more or less double the repair time?
>>>
>>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pauloricard...@gmail.com>
>>> wrote:
>>>>
>>>> Anticompaction throttling can be done by setting the usual
>>>> compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool
>>>> setcompactionthroughput. Did you try lowering that  and checking if that
>>>> improves the dropped mutations?
>>>>
>>>> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I am running incremental repaird on a weekly basis (can't do it every
>>>>> day as one single run takes 36 hours), and every time, I have at least one
>>>>> node dropping mutations as part of the process (this almost always during
>>>>> the anticompaction phase). Ironically this leads to a system where 
>>>>> repairing
>>>>> makes data consistent at the cost of making some other data not 
>>>>> consistent.
>>>>>
>>>>> Does anybody know why this is happening?
>>>>>
>>>>> My feeling is that this might be caused by anticompacting column
>>>>> families with really wide rows and with many SStables. If that is the 
>>>>> case,
>>>>> any way I can throttle that?
>>>>>
>>>>> Thanks!
>>>>> Stefano
>>>>
>>>>
>>>
>>
>


Re: Import failure for use python cassandra-driver

2016-10-26 Thread Stefano Ortolani
Did you try the workaround they posted (aka, downgrading Cython)?

Cheers,
Stefano

On Wed, Oct 26, 2016 at 10:01 AM, Zao Liu  wrote:
> Same happen to my ubuntu boxes.
>
>   File
> "/home/jasonl/.pex/install/cassandra_driver-3.7.0-cp27-none-linux_x86_64.whl.ebfb31ab99650d53ad134e0b312c7494296cdd2b/cassandra_driver-3.7.0-cp27-none-linux_x86_64.whl/cassandra/cqlengine/connection.py",
> line 20, in 
>
> from cassandra.cluster import Cluster, _NOT_SET, NoHostAvailable,
> UserTypeDoesNotExist
>
> ImportError:
> /home/jasonl/.pex/install/cassandra_driver-3.7.0-cp27-none-linux_x86_64.whl.ebfb31ab99650d53ad134e0b312c7494296cdd2b/cassandra_driver-3.7.0-cp27-none-linux_x86_64.whl/cassandra/cluster.so:
> undefined symbol: PyException_Check
>
>
> And there is someone asked the same question in stack overflow:
>
> http://stackoverflow.com/questions/40251893/datastax-python-cassandra-driver-build-fails-on-ubuntu#
>
>
>
> On Wed, Oct 26, 2016 at 1:49 AM, Zao Liu  wrote:
>>
>> Hi,
>>
>> Suddenly I start to get this following errors when use python cassandra
>> driver 3.7.0 in my macbook pro running OS X EI Capitan. Tries to reinstall
>> the package and all the dependencies, unfortunately no luck. I was able to
>> run it a few days earlier. Really can't recall what I changed could cause
>> this.
>>
>>   File
>> "/Library/Python/2.7/site-packages/cassandra/cqlengine/connection.py", line
>> 20, in 
>> from cassandra.cluster import Cluster, _NOT_SET, NoHostAvailable,
>> UserTypeDoesNotExist
>> ImportError:
>> dlopen(/Library/Python/2.7/site-packages/cassandra/cluster.so, 2): Symbol
>> not found: _PyException_Check
>>   Referenced from: /Library/Python/2.7/site-packages/cassandra/cluster.so
>>   Expected in: flat namespace
>>  in /Library/Python/2.7/site-packages/cassandra/cluster.so
>>
>> Thanks,
>> Jason
>>
>>
>


Re: Repairing without -pr shows unexpected out-of-sync ranges

2016-10-03 Thread Stefano Ortolani
I was wondering: is (2) a direct consequence of a repair on the full
token range (and thus anti-compaction ran only on a subset of the RF
nodes)?. If I understand correctly, a repair with -pr should fix this,
at the cost of all nodes performing the anticompaction phase?

Cheers,
Stefano

On Tue, Sep 27, 2016 at 4:09 PM, Stefano Ortolani <ostef...@gmail.com> wrote:
> Didn't know about (2), and I actually have a time drift between the nodes.
> Thanks a lot Paulo!
>
> Regards,
> Stefano
>
> On Thu, Sep 22, 2016 at 6:36 PM, Paulo Motta <pauloricard...@gmail.com>
> wrote:
>>
>> There are a couple of things that could be happening here:
>> - There will be time differences between when nodes participating repair
>> flush, so in write-heavy tables there will always be minor differences
>> during validation, and those could be accentuated by low resolution merkle
>> trees, which will affect mostly larger tables.
>> - SSTables compacted during incremental repair will not be marked as
>> repaired, so nodes with different compaction cadences will have different
>> data in their unrepaired set, what will cause mismatches in the subsequent
>> incremental repairs. CASSANDRA-9143 will hopefully fix that limitation.
>>
>> 2016-09-22 7:10 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>>>
>>> Hi,
>>>
>>> I am seeing something weird while running repairs.
>>> I am testing 3.0.9 so I am running the repairs manually, node after node,
>>> on a cluster with RF=3. I am using a standard repair command (incremental,
>>> parallel, full range), and I just noticed that the third node detected some
>>> ranges out of sync with one of the nodes that just finished repairing.
>>>
>>> Since there was no dropped mutation, that sounds weird to me considering
>>> that the repairs are supposed to operate on the whole range.
>>>
>>> Any idea why?
>>> Maybe I am missing something?
>>>
>>> Cheers,
>>> Stefano
>>>
>>
>


Re: Repairing without -pr shows unexpected out-of-sync ranges

2016-09-27 Thread Stefano Ortolani
Didn't know about (2), and I actually have a time drift between the nodes.
Thanks a lot Paulo!

Regards,
Stefano

On Thu, Sep 22, 2016 at 6:36 PM, Paulo Motta <pauloricard...@gmail.com>
wrote:

> There are a couple of things that could be happening here:
> - There will be time differences between when nodes participating repair
> flush, so in write-heavy tables there will always be minor differences
> during validation, and those could be accentuated by low resolution merkle
> trees, which will affect mostly larger tables.
> - SSTables compacted during incremental repair will not be marked as
> repaired, so nodes with different compaction cadences will have different
> data in their unrepaired set, what will cause mismatches in the subsequent
> incremental repairs. CASSANDRA-9143 will hopefully fix that limitation.
>
> 2016-09-22 7:10 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>
>> Hi,
>>
>> I am seeing something weird while running repairs.
>> I am testing 3.0.9 so I am running the repairs manually, node after node,
>> on a cluster with RF=3. I am using a standard repair command (incremental,
>> parallel, full range), and I just noticed that the third node detected some
>> ranges out of sync with one of the nodes that just finished repairing.
>>
>> Since there was no dropped mutation, that sounds weird to me considering
>> that the repairs are supposed to operate on the whole range.
>>
>> Any idea why?
>> Maybe I am missing something?
>>
>> Cheers,
>> Stefano
>>
>>
>


Repairing without -pr shows unexpected out-of-sync ranges

2016-09-22 Thread Stefano Ortolani
Hi,

I am seeing something weird while running repairs.
I am testing 3.0.9 so I am running the repairs manually, node after node,
on a cluster with RF=3. I am using a standard repair command (incremental,
parallel, full range), and I just noticed that the third node detected some
ranges out of sync with one of the nodes that just finished repairing.

Since there was no dropped mutation, that sounds weird to me considering
that the repairs are supposed to operate on the whole range.

Any idea why?
Maybe I am missing something?

Cheers,
Stefano


Re: How to start using incremental repairs?

2016-08-26 Thread Stefano Ortolani
An extract of this conversation should definitely be posted somewhere.
Read a lot but never learnt all these bits...

On Fri, Aug 26, 2016 at 2:53 PM, Paulo Motta <pauloricard...@gmail.com>
wrote:

> > I must admit that I fail to understand currently how running repair with
> -pr could leave unrepaired data though, even when ran on all nodes in all
> DCs, and how that could be specific to incremental repair (and would
> appreciate if someone shared the explanation).
>
> Anti-compaction, which marks tables as repaired, is disabled for partial
> range repairs (which includes partitioner-range repair) to avoid the extra
> I/O cost of needing to run anti-compaction multiple times in a node to
> repair it completely. For example, there is an optimization which skips
> anti-compaction for sstables fully contained in the repaired range (only
> the repairedAt field is mutated), which is leveraged by full range repair,
> which would not work in many cases for partial range repairs, yielding
> higher I/O.
>
> 2016-08-26 10:17 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>
>> I see. Didn't think about it that way. Thanks for clarifying!
>>
>>
>> On Fri, Aug 26, 2016 at 2:14 PM, Paulo Motta <pauloricard...@gmail.com>
>> wrote:
>>
>>> > What is the underlying reason?
>>>
>>> Basically to minimize the amount of anti-compaction needed, since with
>>> RF=3 you'd need to perform anti-compaction 3 times in a particular node to
>>> get it fully repaired, while without it you can just repair the full node's
>>> range in one run. Assuming you run repair frequent enough this will not be
>>> a big deal, since you will skip already repaired data in the next round so
>>> you will not have the problem of re-doing work as in non-inc non-pr repair.
>>>
>>> 2016-08-26 7:57 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>>>
>>>> Hi Paulo, could you elaborate on 2?
>>>> I didn't know incremental repairs were not compatible with -pr
>>>> What is the underlying reason?
>>>>
>>>> Regards,
>>>> Stefano
>>>>
>>>>
>>>> On Fri, Aug 26, 2016 at 1:25 AM, Paulo Motta <pauloricard...@gmail.com>
>>>> wrote:
>>>>
>>>>> 1. Migration procedure is no longer necessary after CASSANDRA-8004,
>>>>> and since you never ran repair before this would not make any difference
>>>>> anyway, so just run repair and by default (CASSANDRA-7250) this will
>>>>> already be incremental.
>>>>> 2. Incremental repair is not supported with -pr, -local or -st/-et
>>>>> options, so you should run incremental repair in all nodes in all DCs
>>>>> sequentially (you should be aware that this will probably generate 
>>>>> inter-DC
>>>>> traffic), no need to disable autocompaction or stopping nodes.
>>>>>
>>>>> 2016-08-25 18:27 GMT-03:00 Aleksandr Ivanov <ale...@gmail.com>:
>>>>>
>>>>>> I’m new in Cassandra and trying to figure out how to _start_ using
>>>>>> incremental repairs. I have seen article about “Migrating to incremental
>>>>>> repairs” but since I didn’t use repairs before at all and I use Cassandra
>>>>>> version v3.0.8, then maybe not all steps are needed which are mentioned 
>>>>>> in
>>>>>> Datastax article.
>>>>>> Should I start with full repair or I can start with executing
>>>>>> “nodetool repair -pr  my_keyspace” on all nodes without autocompaction
>>>>>> disabling and node stopping?
>>>>>>
>>>>>> I have 6 datacenters with 6 nodes in each DC. Is it enough to run
>>>>>>  “nodetool repair -pr  my_keyspace” in one DC only or it should be 
>>>>>> executed
>>>>>> on all nodes in _all_ DCs?
>>>>>>
>>>>>> I have tried to perform “nodetool repair -pr  my_keyspace” on all
>>>>>> nodes in all datacenters sequentially but I still can see non repaired
>>>>>> SSTables for my_keyspace   (Repaired at: 0). Is it expected behavior if
>>>>>> during repair data in my_keyspace wasn’t modified (no writes, no reads)?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: How to start using incremental repairs?

2016-08-26 Thread Stefano Ortolani
I see. Didn't think about it that way. Thanks for clarifying!

On Fri, Aug 26, 2016 at 2:14 PM, Paulo Motta <pauloricard...@gmail.com>
wrote:

> > What is the underlying reason?
>
> Basically to minimize the amount of anti-compaction needed, since with
> RF=3 you'd need to perform anti-compaction 3 times in a particular node to
> get it fully repaired, while without it you can just repair the full node's
> range in one run. Assuming you run repair frequent enough this will not be
> a big deal, since you will skip already repaired data in the next round so
> you will not have the problem of re-doing work as in non-inc non-pr repair.
>
> 2016-08-26 7:57 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>
>> Hi Paulo, could you elaborate on 2?
>> I didn't know incremental repairs were not compatible with -pr
>> What is the underlying reason?
>>
>> Regards,
>> Stefano
>>
>>
>> On Fri, Aug 26, 2016 at 1:25 AM, Paulo Motta <pauloricard...@gmail.com>
>> wrote:
>>
>>> 1. Migration procedure is no longer necessary after CASSANDRA-8004, and
>>> since you never ran repair before this would not make any difference
>>> anyway, so just run repair and by default (CASSANDRA-7250) this will
>>> already be incremental.
>>> 2. Incremental repair is not supported with -pr, -local or -st/-et
>>> options, so you should run incremental repair in all nodes in all DCs
>>> sequentially (you should be aware that this will probably generate inter-DC
>>> traffic), no need to disable autocompaction or stopping nodes.
>>>
>>> 2016-08-25 18:27 GMT-03:00 Aleksandr Ivanov <ale...@gmail.com>:
>>>
>>>> I’m new in Cassandra and trying to figure out how to _start_ using
>>>> incremental repairs. I have seen article about “Migrating to incremental
>>>> repairs” but since I didn’t use repairs before at all and I use Cassandra
>>>> version v3.0.8, then maybe not all steps are needed which are mentioned in
>>>> Datastax article.
>>>> Should I start with full repair or I can start with executing “nodetool
>>>> repair -pr  my_keyspace” on all nodes without autocompaction disabling and
>>>> node stopping?
>>>>
>>>> I have 6 datacenters with 6 nodes in each DC. Is it enough to run
>>>>  “nodetool repair -pr  my_keyspace” in one DC only or it should be executed
>>>> on all nodes in _all_ DCs?
>>>>
>>>> I have tried to perform “nodetool repair -pr  my_keyspace” on all nodes
>>>> in all datacenters sequentially but I still can see non repaired SSTables
>>>> for my_keyspace   (Repaired at: 0). Is it expected behavior if during
>>>> repair data in my_keyspace wasn’t modified (no writes, no reads)?
>>>>
>>>
>>>
>>
>


Re: How to start using incremental repairs?

2016-08-26 Thread Stefano Ortolani
Hi Paulo, could you elaborate on 2?
I didn't know incremental repairs were not compatible with -pr
What is the underlying reason?

Regards,
Stefano

On Fri, Aug 26, 2016 at 1:25 AM, Paulo Motta 
wrote:

> 1. Migration procedure is no longer necessary after CASSANDRA-8004, and
> since you never ran repair before this would not make any difference
> anyway, so just run repair and by default (CASSANDRA-7250) this will
> already be incremental.
> 2. Incremental repair is not supported with -pr, -local or -st/-et
> options, so you should run incremental repair in all nodes in all DCs
> sequentially (you should be aware that this will probably generate inter-DC
> traffic), no need to disable autocompaction or stopping nodes.
>
> 2016-08-25 18:27 GMT-03:00 Aleksandr Ivanov :
>
>> I’m new in Cassandra and trying to figure out how to _start_ using
>> incremental repairs. I have seen article about “Migrating to incremental
>> repairs” but since I didn’t use repairs before at all and I use Cassandra
>> version v3.0.8, then maybe not all steps are needed which are mentioned in
>> Datastax article.
>> Should I start with full repair or I can start with executing “nodetool
>> repair -pr  my_keyspace” on all nodes without autocompaction disabling and
>> node stopping?
>>
>> I have 6 datacenters with 6 nodes in each DC. Is it enough to run
>>  “nodetool repair -pr  my_keyspace” in one DC only or it should be executed
>> on all nodes in _all_ DCs?
>>
>> I have tried to perform “nodetool repair -pr  my_keyspace” on all nodes
>> in all datacenters sequentially but I still can see non repaired SSTables
>> for my_keyspace   (Repaired at: 0). Is it expected behavior if during
>> repair data in my_keyspace wasn’t modified (no writes, no reads)?
>>
>
>


Re: JVM Crash on 3.0.6

2016-08-11 Thread Stefano Ortolani
Not really related, but know that on 12.04 I had to disable jemalloc,
otherwise nodes would randomly die at startup (
https://issues.apache.org/jira/browse/CASSANDRA-11723)

Regards,
Stefano

On Thu, Aug 11, 2016 at 10:28 AM, Riccardo Ferrari 
wrote:

> Hi C* users,
>
> In recent time I had couple of my nodes crashing (on different dates). I
> don't have core dumps however my JVM crash logs goes like this:
> ===
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f8f608c8e40, pid=6916, tid=140253195458304
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # C  [liblz4-java6471621810388748482.so+0x5e40]  LZ4_decompress_fast+0xa0
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> ...
> ---  T H R E A D  ---
>
>
> Current thread (0x7f8f5c7b2d50):  JavaThread
> "CompactionExecutor:11952" daemon [_thread_in_native, id=16219,
> stack(0x7f8f3de0d000,0x7f8f3de4e000)]
> ...
> Stack: [0x7f8f3de0d000,0x7f8f3de4e000],  sp=0x7f8f3de4c0e0,
>  free space=252k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> code)
> C  [liblz4-java6471621810388748482.so+0x5e40]  LZ4_decompress_fast+0xa0
>
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> J 4150  net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast([BLjava/nio/
> ByteBuffer;I[BLjava/nio/ByteBuffer;II)I (0 bytes) @ 0x7f8f791e4723
> [0x7f8f791e4680+0xa3]
> J 19836 C2 
> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap()V
> (354 bytes) @ 0x7f8f7b714930 [0x7f8f7b714320+0x610]
> J 6662 C2 org.apache.cassandra.db.columniterator.
> AbstractSSTableIterator.(Lorg/apache/cassandra/io/
> sstable/format/SSTableReader;Lorg/apache/cassandra/io/util/
> FileDataInput;Lorg/apache/cassandra/db/DecoratedKey;
> Lorg/apache/cassandra/db/RowIndexEntry;Lorg/apache
> /cassandra/db/filter/ColumnFilter;Z)V (389 bytes) @ 0x7f8f79c1cdb8
> [0x7f8f79c1c500+0x8b8]
> J 22393 C2 org.apache.cassandra.db.SinglePartitionReadCommand.
> queryMemtableAndDiskInternal(Lorg/apache/cassandra/db/
> ColumnFamilyStore;Z)Lorg/apache/cassandra/db/rows/UnfilteredRowIterator;
> (818 bytes) @ 0x7f8f7c1d4364 [0x7f8f7c1d2f40+0x1424]
> J 22166 C1 org.apache.cassandra.db.Keyspace.indexPartition(Lorg/
> apache/cassandra/db/DecoratedKey;Lorg/apache/cassandra/db/
> ColumnFamilyStore;Ljava/util/Set;)V (274 bytes) @ 0x7f8f7beb6304
> [0x7f8f7beb5420+0xee4]
> j  org.apache.cassandra.index.SecondaryIndexBuilder.build()V+46
> j  org.apache.cassandra.db.compaction.CompactionManager$11.run()V+18
> J 22293 C2 java.util.concurrent.ThreadPoolExecutor.runWorker(
> Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @
> 0x7f8f7b17727c [0x7f8f7b176da0+0x4dc]
> J 21302 C2 java.lang.Thread.run()V (17 bytes) @ 0x7f8f79fe59f8
> [0x7f8f79fe59a0+0x58]
> v  ~StubRoutines::call_stub
> ...
> VM state:not at safepoint (normal execution)
>
> VM Mutex/Monitor currently owned by a thread: None
>
> Heap:
>  par new generation   total 368640K, used 123009K [0x0006d5e0,
> 0x0006eee0, 0x0006eee0)
>   eden space 327680K,  34% used [0x0006d5e0, 0x0006dcaf35c8,
> 0x0006e9e0)
>   from space 40960K,  27% used [0x0006e9e0, 0x0006ea92cf00,
> 0x0006ec60)
>   to   space 40960K,   0% used [0x0006ec60, 0x0006ec60,
> 0x0006eee0)
>  concurrent mark-sweep generation total 3426304K, used 1288977K
> [0x0006eee0, 0x0007c000, 0x0007c000)
>  Metaspace   used 41685K, capacity 42832K, committed 43156K, reserved
> 1087488K
>   class spaceused 4455K, capacity 4702K, committed 4756K, reserved
> 1048576K
> ...
> OS:DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=12.04
> DISTRIB_CODENAME=precise
> DISTRIB_DESCRIPTION="Ubuntu 12.04.1 LTS"
>
> uname:Linux 3.2.0-35-virtual #55-Ubuntu SMP Wed Dec 5 18:02:05 UTC 2012
> x86_64
> libc:glibc 2.15 NPTL 2.15
> rlimit: STACK 8192k, CORE 0k, NPROC 119708, NOFILE 10, AS infinity
> load average:2.96 1.08 0.60
>
> What am I missing?
> Both crashes seems to happen during compaction and when running native
> code (LZ4).
> Both crashes happens when the nodes are doing scheduled repair (so under
> increased load).
> Machines are 4vCPUs and 15GB ram (m1.xlarge)
> Any hint?
>
> Best,
>


Re: Incremental repairs leading to unrepaired data

2016-08-10 Thread Stefano Ortolani
That's what I was thinking. Maybe GC pressure?
Some more details: during anticompaction I have some CFs exploding to 1K
SStables (to be back to ~200 upon completion).
HW specs should be quite good (12 cores/32 GB ram) but, I admit, still
relying on spinning disks, with ~150GB per node.
Current version is 3.0.8.


On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta <pauloricard...@gmail.com>
wrote:

> That's pretty low already, but perhaps you should lower to see if it will
> improve the dropped mutations during anti-compaction (even if it increases
> repair time), otherwise the problem might be somewhere else. Generally
> dropped mutations is a signal of cluster overload, so if there's nothing
> else wrong perhaps you need to increase your capacity. What version are you
> in?
>
> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>
>> Not yet. Right now I have it set at 16.
>> Would halving it more or less double the repair time?
>>
>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pauloricard...@gmail.com>
>> wrote:
>>
>>> Anticompaction throttling can be done by setting the usual
>>> compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool
>>> setcompactionthroughput. Did you try lowering that  and checking if that
>>> improves the dropped mutations?
>>>
>>> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>>>
>>>> Hi all,
>>>>
>>>> I am running incremental repaird on a weekly basis (can't do it every
>>>> day as one single run takes 36 hours), and every time, I have at least one
>>>> node dropping mutations as part of the process (this almost always during
>>>> the anticompaction phase). Ironically this leads to a system where
>>>> repairing makes data consistent at the cost of making some other data not
>>>> consistent.
>>>>
>>>> Does anybody know why this is happening?
>>>>
>>>> My feeling is that this might be caused by anticompacting column
>>>> families with really wide rows and with many SStables. If that is the case,
>>>> any way I can throttle that?
>>>>
>>>> Thanks!
>>>> Stefano
>>>>
>>>
>>>
>>
>


Re: Incremental repairs leading to unrepaired data

2016-08-10 Thread Stefano Ortolani
Not yet. Right now I have it set at 16.
Would halving it more or less double the repair time?

On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pauloricard...@gmail.com>
wrote:

> Anticompaction throttling can be done by setting the usual
> compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool
> setcompactionthroughput. Did you try lowering that  and checking if that
> improves the dropped mutations?
>
> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <ostef...@gmail.com>:
>
>> Hi all,
>>
>> I am running incremental repaird on a weekly basis (can't do it every day
>> as one single run takes 36 hours), and every time, I have at least one node
>> dropping mutations as part of the process (this almost always during the
>> anticompaction phase). Ironically this leads to a system where repairing
>> makes data consistent at the cost of making some other data not consistent.
>>
>> Does anybody know why this is happening?
>>
>> My feeling is that this might be caused by anticompacting column families
>> with really wide rows and with many SStables. If that is the case, any way
>> I can throttle that?
>>
>> Thanks!
>> Stefano
>>
>
>


Incremental repairs leading to unrepaired data

2016-08-09 Thread Stefano Ortolani
Hi all,

I am running incremental repaird on a weekly basis (can't do it every day
as one single run takes 36 hours), and every time, I have at least one node
dropping mutations as part of the process (this almost always during the
anticompaction phase). Ironically this leads to a system where repairing
makes data consistent at the cost of making some other data not consistent.

Does anybody know why this is happening?

My feeling is that this might be caused by anticompacting column families
with really wide rows and with many SStables. If that is the case, any way
I can throttle that?

Thanks!
Stefano


Re: (C)* stable version after 3.5

2016-07-14 Thread Stefano Ortolani
FWIW, I've recently upgraded from 2.1 to 3.0 without issues of any sort,
but admittedly I haven't been using anything too fancy.

Cheers,
Stefano

On Wed, Jul 13, 2016 at 10:28 PM, Alain RODRIGUEZ 
wrote:

> Hi Anuj
>
> From
> https://docs.datastax.com/en/latest-upgrade/upgrade/cassandra/upgrdBestPractCassandra.html
> :
>
>
>>- Employ a continual upgrade strategy for each year. Upgrades are
>>impacted by the version you are upgrading from and the version you are
>>upgrading to. The greater the gap between the current version and the
>>target version, the more complex the upgrade.
>>
>>
> And I could not find it but historically I am quite sure it was explicitly
> recommended not to skip a major update (for a rolling upgrade), even if I
> could not find it. Anyway it is clear that the bigger the gap is, the more
> careful we need to be.
>
> On the other hand, I see 2.2 as a 2.1 + some feature but no real breaking
> changes (as 3.0 was already on the pipe) and doing a 2.2 was decided
> because 3.0 was taking a long time to be released and some feature were
> ready for a while.
>
> I might be wrong on some stuff above, but one can only speak with his
> knowledge and from his point of view. So I ended up saying:
>
> Also I am not sure if the 2.2 major version is something you can skip
>> while upgrading through a rolling restart. I believe you can, but it is not
>> what is recommended.
>>
>
> Note that "I am not sure", "I believe you can"... So it was more a
> thought, something to explore for Varun :-).
>
> And I actually encouraged him to move forward. Now that Tyler Hobbs
> confirmed it works, you can put a lot more trust on the fact that this
> upgrade will work :-). I would still encourage people to test it (for
> client compatibility, corner cases due to models, ...).
>
> I hope I am more clear now,
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-07-13 18:39 GMT+02:00 Tyler Hobbs :
>
>>
>> On Wed, Jul 13, 2016 at 11:32 AM, Anuj Wadehra 
>> wrote:
>>
>>> Why do you think that skipping 2.2 is not recommended when NEWS.txt
>>> suggests otherwise? Can you elaborate?
>>
>>
>> We test upgrading from 2.1 -> 3.x and upgrading from 2.2 -> 3.x
>> equivalently.  There should not be a difference in terms of how well the
>> upgrade is supported.
>>
>>
>> --
>> Tyler Hobbs
>> DataStax 
>>
>
>


Re: Open source equivalents of OpsCenter

2016-07-14 Thread Stefano Ortolani
Replaced OpsCenter with a mix of:

* metrics-graphite-3.1.0.jar installed in the same classpath of C*
* Custom script to push system metrics (cpu/mem/io)
* Grafana to create the dashboard
* Custom repairs script

Still not optimal but getting there...

Stefano

On Thu, Jul 14, 2016 at 10:18 AM, Romain Hardouin 
wrote:

> Hi Juho,
>
> Out of curiosity, which stack did you use to make your dashboard?
>
> Romain
>
> Le Jeudi 14 juillet 2016 10h43, Juho Mäkinen  a
> écrit :
>
>
> I'm doing some work on replacing OpsCenter in out setup. I ended creating
> a Docker container which contains the following features:
>
>  - Cassandra 2.2.7
>  - MX4J (a JMX to REST bridge) as a java-agent
>  - metrics-graphite-3.1.0.jar (export some but not all JMX to graphite)
>  - a custom ruby which uses MX4J to export some JMX metrics to graphite
> which we don't otherwise get.
>
> With this I will get all our cassandra instances and their JMX exposed
> data to graphite, which allows us to use Grafana and Graphite to draw
> pretty dashboards.
>
> In addition I started writing some code which currently provides the
> following features:
>  - A dashboard which provides a similar ring view what OpsCenter does,
> with onMouseOver features to display more info on each node.
>  - Simple HTTP GET/POST based api to do
> - Setup a new non-vnode based cluster
> - Get a JSON blob on cluster information, all its tokens, machines and
> so on
> - Api for new cluster instances so that they can get a token slot from
> the ring when they boot.
> - Option to kill a dead node and mark its slot for replace, so the new
> booting node can use cassandra.replace_address option.
>
> The node is not yet packaged in any way for distribution and some parts
> depend on our Chef installation, but if there's interest I can publish at
> least some parts from it.
>
>  - Garo
>
> On Thu, Jul 14, 2016 at 10:54 AM, Romain Hardouin 
> wrote:
>
> Do you run C* on physical machine or in the cloud? If the topology doesn't
> change too often you can have a look a Zabbix. The downside is that you
> have to set up all the JMX metrics yourself... but that's also a good point
> because you can have custom metrics. If you want nice graphs/dashboards you
> can use Grafana to plot Zabbix data. (We're also using SaaS but that's not
> open source).
> For the rolling restart and other admin stuff we're using Rundeck. It's a
> great tool when working in a team.
>
> (I think it's time to implement an open source alternative to OpsCenter.
> If some guys are interested I'm in.)
>
> Best,
>
> Romain
>
>
>
>
> Le Jeudi 14 juillet 2016 0h01, Ranjib Dey  a écrit :
>
>
> we use datadog (metrics emitted as raw statsd) for the dashboard. All
> repair & compaction is done via blender & serf[1].
> [1]https://github.com/pagerduty/blender
>
>
> On Wed, Jul 13, 2016 at 2:42 PM, Kevin O'Connor  wrote:
>
> Now that OpsCenter doesn't work with open source installs, are there any
> runs at an open source equivalent? I'd be more interested in looking at
> metrics of a running cluster and doing other tasks like managing
> repairs/rolling restarts more so than historical data.
>
>
>
>
>
>
>
>


Re: C* 3.0.7 - Compactions pending after TRUNCATE

2016-06-28 Thread Stefano Ortolani
I am updating the following ticket
https://issues.apache.org/jira/browse/CASSANDRA-12100 as I discover new
bits.

Regards,
Stefano

On Tue, Jun 28, 2016 at 9:37 AM, Stefano Ortolani <ostef...@gmail.com>
wrote:

> Hi all,
>
> I've just updated to C* 3.0.7, and I am now seeing some compactions
> getting stuck: nodetool compationstats has been showing the same output for
> the last 6 hours, and all compactions are stuck at 0.00% progress.
>
> What is interesting is that the stuck CFs have been truncated exactly 6
> hours ago.
> I am fairly confident this issue was not there in C* 3.0.5.
>
> Any idea?
>
> Regards,
> Stefano Ortolani
>
>
>


C* 3.0.7 - Compactions pending after TRUNCATE

2016-06-28 Thread Stefano Ortolani
Hi all,

I've just updated to C* 3.0.7, and I am now seeing some compactions getting
stuck: nodetool compationstats has been showing the same output for the
last 6 hours, and all compactions are stuck at 0.00% progress.

What is interesting is that the stuck CFs have been truncated exactly 6
hours ago.
I am fairly confident this issue was not there in C* 3.0.5.

Any idea?

Regards,
Stefano Ortolani


Re: Question about sequential repair vs parallel

2016-06-23 Thread Stefano Ortolani
Yes, because you keep a snapshot in the meanwhile if I remember correctly.

Regards,
Stefano

On Thu, Jun 23, 2016 at 4:22 PM, Jean Carlo 
wrote:

> Cassandra 2.1.12
>
> In the moment of a repair -pr sequential, we are experimenting an
> exponential increase of number of sstables. For a table lcs.
>
> If someone of you guys kowns if theoritically speaking a sequential repair
> doing more sstable streaming among replicas than a parallel repair?
>
>
>
> Best regards
>
> Jean Carlo
>
> "The best way to predict the future is to invent it" Alan Kay
>


Re: Incorrect progress percentage while repairing

2016-06-02 Thread Stefano Ortolani
Forgot to add the C* version. That would be 3.0.6.

Regards,
Stefano Ortolani

On Thu, Jun 2, 2016 at 3:55 PM, Stefano Ortolani <ostef...@gmail.com> wrote:

> Hi,
>
> While running incremental (parallel) repairs on the first partition range
> (-pr), I rarely see the progress percentage going over 20%/25%.
>
> 2016-06-02 14:12:23,207] Repair session
> cceae4c0-28b0-11e6-86d1-0550db2f124e for range
> [(8861148493126800521,8883879502599079650]] finished (progress: 22%)
>
> Nodetool does return normally and no error is found in its output or in
> the cassandra logs.
> Any idea why? Is this behavior expected?
>
> Regards,
> Stefano Ortolani
>
>
>


Incorrect progress percentage while repairing

2016-06-02 Thread Stefano Ortolani
Hi,

While running incremental (parallel) repairs on the first partition range
(-pr), I rarely see the progress percentage going over 20%/25%.

2016-06-02 14:12:23,207] Repair session
cceae4c0-28b0-11e6-86d1-0550db2f124e for range
[(8861148493126800521,8883879502599079650]] finished (progress: 22%)

Nodetool does return normally and no error is found in its output or in the
cassandra logs.
Any idea why? Is this behavior expected?

Regards,
Stefano Ortolani


Re: Upgrade from 2.1.11 to 3.0.5 leads to unstable nodes

2016-05-06 Thread Stefano Ortolani
Hi all,

Just updated the ticket. It turned out it was libjemalloc segfaulting the
JVM.
Regardless of the Java version (tried to update but no improvement), new C*
versions (maybe because they preload libjemalloc by default) seem to be
affected.

Cheers,
Stefano

On Thu, May 5, 2016 at 5:01 PM, Stefano Ortolani <ostef...@gmail.com> wrote:

> Hi,
>
> I am experiencing some weird behaviors after upgrading 2 nodes (out of 13)
> to C* 3.0.5 (from 2.1.11). Basically, after restarting a second time, there
> is a small chance that the node will die without outputting anything to the
> logs (not even dmesg).
>
> This happened on both nodes I upgraded. The only "anomalies" I see in the
> logs (although not related to the moment a node dies) are:
>
> * Lots of the following messages against all IPs of the cluster (every
> second)
>
> DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
> Ignoring interval time of 2540341017 for /x.y.b.5
> DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
> Ignoring interval time of 2000551507 for /x.y.a.7
> DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
> Ignoring interval time of 2000479104 for /x.y.a.3
> DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
> Ignoring interval time of 2000471247 for /x.y.b.3
> DEBUG [GossipStage:1] 2016-05-05 23:52:03,259 FailureDetector.java:456 -
> Ignoring interval time of 2000605748 for /x.y.a.5
> DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 -
> Ignoring interval time of 2000731307 for /x.y.b.6
> DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 -
> Ignoring interval time of 3000404107 for /x.y.b.1
>
> * Some metrics are not being pushed to graphite (but some do get to the
> server). Also, every time the node tries to push them I can see the
> following error in the logs:
>
> ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-05 23:53:37,770
> ScheduledReporter.java:119 - RuntimeException thrown from
> GraphiteReporter#report. Exception was suppressed.
> java.lang.IllegalStateException: Unable to compute ceiling for max when
> histogram overflowed
> at
> org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:231)
> ~[apache-cassandra-3.0.5.jar:3.0.5]
> at
> org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103)
> ~[apache-cassandra-3.0.5.jar:3.0.5]
> at
> com.codahale.metrics.graphite.GraphiteReporter.reportHistogram(GraphiteReporter.java:252)
> ~[metrics-graphite-3.1.0.jar:3.1.0]
> at
> com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:166)
> ~[metrics-graphite-3.1.0.jar:3.1.0]
> at
> com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162)
> ~[metrics-core-3.1.0.jar:3.1.0]
> at
> com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117)
> ~[metrics-core-3.1.0.jar:3.1.0]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_60]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> [na:1.8.0_60]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> [na:1.8.0_60]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> [na:1.8.0_60]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_60]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_60]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
>
> Besides these, logs are clean. I've opened a ticket here (
> https://issues.apache.org/jira/browse/CASSANDRA-11723) but any help
> debugging this is more than welcome.
>
> Regards,
> Stefano
>
>


Upgrade from 2.1.11 to 3.0.5 leads to unstable nodes

2016-05-05 Thread Stefano Ortolani
Hi,

I am experiencing some weird behaviors after upgrading 2 nodes (out of 13)
to C* 3.0.5 (from 2.1.11). Basically, after restarting a second time, there
is a small chance that the node will die without outputting anything to the
logs (not even dmesg).

This happened on both nodes I upgraded. The only "anomalies" I see in the
logs (although not related to the moment a node dies) are:

* Lots of the following messages against all IPs of the cluster (every
second)

DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
Ignoring interval time of 2540341017 for /x.y.b.5
DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
Ignoring interval time of 2000551507 for /x.y.a.7
DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
Ignoring interval time of 2000479104 for /x.y.a.3
DEBUG [GossipStage:1] 2016-05-05 23:52:02,260 FailureDetector.java:456 -
Ignoring interval time of 2000471247 for /x.y.b.3
DEBUG [GossipStage:1] 2016-05-05 23:52:03,259 FailureDetector.java:456 -
Ignoring interval time of 2000605748 for /x.y.a.5
DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 -
Ignoring interval time of 2000731307 for /x.y.b.6
DEBUG [GossipStage:1] 2016-05-05 23:52:03,260 FailureDetector.java:456 -
Ignoring interval time of 3000404107 for /x.y.b.1

* Some metrics are not being pushed to graphite (but some do get to the
server). Also, every time the node tries to push them I can see the
following error in the logs:

ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-05 23:53:37,770
ScheduledReporter.java:119 - RuntimeException thrown from
GraphiteReporter#report. Exception was suppressed.
java.lang.IllegalStateException: Unable to compute ceiling for max when
histogram overflowed
at
org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:231)
~[apache-cassandra-3.0.5.jar:3.0.5]
at
org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103)
~[apache-cassandra-3.0.5.jar:3.0.5]
at
com.codahale.metrics.graphite.GraphiteReporter.reportHistogram(GraphiteReporter.java:252)
~[metrics-graphite-3.1.0.jar:3.1.0]
at
com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:166)
~[metrics-graphite-3.1.0.jar:3.1.0]
at
com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162)
~[metrics-core-3.1.0.jar:3.1.0]
at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117)
~[metrics-core-3.1.0.jar:3.1.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[na:1.8.0_60]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[na:1.8.0_60]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[na:1.8.0_60]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[na:1.8.0_60]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_60]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_60]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]

Besides these, logs are clean. I've opened a ticket here (
https://issues.apache.org/jira/browse/CASSANDRA-11723) but any help
debugging this is more than welcome.

Regards,
Stefano


Re: [Marketing Mail] Migrating to incremental repairs

2015-11-19 Thread Stefano Ortolani
As far as I know, docs is quite inconsistent on the matter.
Based on some research here and on IRC, recent versions of Cassandra do no
require anything specific when migrating to incremental repairs but the the
-inc switch even on LCS.

Any confirmation on the matter is more than welcome.

Regards,
Stefano

On Wed, Nov 18, 2015 at 3:59 PM, Reynald Bourtembourg <
reynald.bourtembo...@esrf.fr> wrote:

> Well,
>
> By re-reading my e-mail, I understood the rationale behind doing a full
> sequential repair for each node.
> I was confused by the fact that in our case, we have 3 nodes with RF = 3,
> so all the nodes are storing all replicas.
> So we are in a special case.
> As soon as you have more than 3 nodes, this is no longer the case.
>
> In any case, in our special case (3 nodes and RF=3), could we apply the
> following migration procedure?:
> - disableautocompaction on all nodes at the same time
>  - run the full sequential repair
>  - For each node:
> - stop the node
> - Use the tool sstablerepairedset to mark all the SSTables that
> were created before you disabled compaction.
> - Restart Cassandra
>
> I'd be glad if someone could answer to my other questions in any case ;-).
>
> Thanks in advance for your help
>
> Reynald
>
>
>
> On 18/11/2015 16:45, Reynald Bourtembourg wrote:
>
> Hi,
>
> We currently have a 3 nodes Cassandra cluster with RF = 3.
> We are using Cassandra 2.1.7.
>
> We would like to start using incremental repairs.
> We have some tables using LCS compaction strategy and some others using
> STCS.
>
> Here is the procedure written in the documentation:
>
> To migrate to incremental repair, one node at a time:
>
>1. Disable compaction on the node using nodetool disableautocompaction.
>2. Run the default full, sequential repair.
>3. Stop the node.
>4. Use the tool sstablerepairedset to mark all the SSTables that were
>created before you disabled compaction.
>5. Restart cassandra
>
>
> In our case, a full sequential repair takes about 5 days.
> If I follow the procedure described above and if my understanding is
> correct, it's gonna take at least 15 days (3 repairs of 5 days) before to
> be able to use the incremental repairs, right(?), since we need to do it
> one node at a time (one full sequential repair per node?).
>
> If my understanding is correct, what is the rationale behind the fact that
> we need to run a full sequential repair once for each node?
> I understood a full sequential repair would repair all the sstables on all
> the nodes. So doing it only once should be enough, right?
>
> Is it possible to do the following instead of what is written in the
> documentation?:
>  - disableautocompaction on all nodes at the same time
>  - run the full sequential repair
>  - For each node:
> - stop one node
> - Use the tool sstablerepairedset to mark all the SSTables that
> were created before you disabled compaction.
> - Restart Cassandra
> Without having to run the full sequential repair 3 times?
>
> The documentation states that if we don't execute this migration
> procedure, the first time we will run incremental repair, Cassandra will
> perform size-tiering on all SSTables because the repair/unrepaired status
> is unknown and this operation can take a long time.
> Do you think this operation could take more than 15 days in our case?
>
> I understood that we only need to use sstablerepairedset on the SSTables
> related to the tables using LCS compaction strategy and which were created
> before the auto compaction was disabled.
> Is my understanding correct?
>
> The documentation is not very explicit but I suppose the following
> sentence:
> "4. Use the tool sstablerepairedset to mark all the SSTables that were
> created before you disabled compaction."
> means we need to invoke "sstablerepairedset --is-repaired -f
> list_of_sstable_names.txt" on the LCS SSTables that were created before the
> compaction was disabled.
>
> Is this correct?
>
> Do we need to enableautocompaction again after the Cassandra restart or is
> it done automatically?
>
> Would you recommend us to upgrade our Cassandra version before starting
> the incremental repair migration?
>
> Thank you for your help and sorry for the long e-mail.
>
> Reynald
>
>
>
>
>
>
>


Re: Running Cassandra on Java 8 u60..

2015-09-25 Thread Stefano Ortolani
I think those were referring to Java7 and G1GC (early versions were buggy).

Cheers,
Stefano


On Fri, Sep 25, 2015 at 5:08 PM, Kevin Burton  wrote:

> Any issues with running Cassandra 2.0.16 on Java 8? I remember there is
> long term advice on not changing the GC but not the underlying version of
> Java.
>
> Thoughts?
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


Re: Leveled Compaction Strategy with a really intensive delete workload

2015-05-26 Thread Stefano Ortolani
I see, thanks Jason!

Can a dev confirm it is safe to apply those changes on live data? Also, if
I understood correctly, those parameters still obey the gc_grace_seconds,
that is, no compaction to evict tombstones will take place before
gc_grace_seconds elapsed, correct?

Cheers,
Stefano

On Tue, May 26, 2015 at 5:17 AM, Jason Wee peich...@gmail.com wrote:

 Hi Stefano,

 I did a quick test, it looks almost instant if you do alter but remember,
 in my test machine, there are no loaded data yet and switching from stcs to
 lcs.

 cqlsh:jw_schema1 CREATE TABLE DogTypes ( block_id uuid, species text,
 alias text, population varint, PRIMARY KEY (block_id) ) WITH caching =
 'keys_only' and COMPACTION = {'class': 'SizeTieredCompactionStrategy'};
 cqlsh:jw_schema1 desc table dogtypes;

 CREATE TABLE jw_schema1.dogtypes (
 block_id uuid PRIMARY KEY,
 alias text,
 population varint,
 species text
 ) WITH bloom_filter_fp_chance = 0.01
 AND caching = '{keys:ALL, rows_per_partition:NONE}'
 AND comment = ''
 AND compaction = {'min_threshold': '4', 'class':
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
 'max_threshold': '32'}
 AND compression = {'sstable_compression':
 'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND dclocal_read_repair_chance = 0.1
 AND default_time_to_live = 0
 AND gc_grace_seconds = 864000
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99.0PERCENTILE';

 cqlsh:jw_schema1 ALTER TABLE dogtypes WITH COMPACTION = {'class':
 'LeveledCompactionStrategy', 'tombstone_threshold': '0.3',
 'unchecked_tombstone_compaction': 'true'} ;
 cqlsh:jw_schema1




 -


 INFO  [MigrationStage:1] 2015-05-26 12:12:25,867
 ColumnFamilyStore.java:882 - Enqueuing flush of schema_keyspaces: 436 (0%)
 on-heap, 0 (0%) off-heap
 INFO  [MemtableFlushWriter:146] 2015-05-26 12:12:25,869 Memtable.java:339
 - Writing Memtable-schema_keyspaces@6829883(138 serialized bytes, 3 ops,
 0%/0% of on/off-heap limit)
 INFO  [MemtableFlushWriter:146] 2015-05-26 12:12:26,173 Memtable.java:378
 - Completed flushing
 /var/lib/cassandra/data/system/schema_keyspaces-b0f2235744583cdb9631c43e59ce3676/system-schema_keyspaces-ka-8-Data.db
 (163 bytes) for commitlog position ReplayPosition(segmentId=1432265013436,
 position=423408)
 INFO  [CompactionExecutor:191] 2015-05-26 12:12:26,174
 CompactionTask.java:140 - Compacting
 [SSTableReader(path='/var/lib/cassandra/data/system/schema_keyspaces-b0f2235744583cdb9631c43e59ce3676/system-schema_keyspaces-ka-6-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_keyspaces-b0f2235744583cdb9631c43e59ce3676/system-schema_keyspaces-ka-5-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_keyspaces-b0f2235744583cdb9631c43e59ce3676/system-schema_keyspaces-ka-8-Data.db'),
 SSTableReader(path='/var/lib/cassandra/data/system/schema_keyspaces-b0f2235744583cdb9631c43e59ce3676/system-schema_keyspaces-ka-7-Data.db')]
 INFO  [MigrationStage:1] 2015-05-26 12:12:26,176
 ColumnFamilyStore.java:882 - Enqueuing flush of schema_columnfamilies: 5307
 (0%) on-heap, 0 (0%) off-heap
 INFO  [MemtableFlushWriter:147] 2015-05-26 12:12:26,178 Memtable.java:339
 - Writing Memtable-schema_columnfamilies@32790230(1380 serialized bytes,
 27 ops, 0%/0% of on/off-heap limit)
 INFO  [MemtableFlushWriter:147] 2015-05-26 12:12:26,550 Memtable.java:378
 - Completed flushing
 /var/lib/cassandra/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-7-Data.db
 (922 bytes) for commitlog position ReplayPosition(segmentId=1432265013436,
 position=423408)
 INFO  [MigrationStage:1] 2015-05-26 12:12:26,598
 ColumnFamilyStore.java:882 - Enqueuing flush of dogtypes: 0 (0%) on-heap, 0
 (0%) off-heap
 INFO  [CompactionExecutor:191] 2015-05-26 12:12:26,668
 CompactionTask.java:270 - Compacted 4 sstables to
 [/var/lib/cassandra/data/system/schema_keyspaces-b0f2235744583cdb9631c43e59ce3676/system-schema_keyspaces-ka-9,].
  751 bytes to 262 (~34% of original) in 492ms = 0.000508MB/s.  6 total
 partitions merged to 3.  Partition merge counts were {1:2, 4:1, }


 hth

 jason

 On Tue, May 26, 2015 at 6:24 AM, Stefano Ortolani ostef...@gmail.com
 wrote:

 Ok, I am reading a bit more about compaction subproperties here (
 http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html)
 and it seems that tombstone_threshold and unchecked_tombstone_compaction
 might come handy.

 Does anybody know if changing any of these values (via ALTER) is possible
 without downtime, and how fast those values are picked up?

 Cheers,
 Stefano


 On Mon, May 25, 2015 at 1:32 PM, Stefano Ortolani ostef...@gmail.com
 wrote:

 Hi all,

 Thanks for your answers! Yes, I agree that a delete intensive workload
 is not something Cassandra is designed for.

 Unfortunately

Re: LeveledCompactionStrategy

2015-05-26 Thread Stefano Ortolani
Hi Jean,

I am trying to solve a similar problem here. I would say that the only
deterministic way is to rebuild the SStable of that column family via
nodetool scrub.

Otherwise you'd need to :
* decrease tombstone_threshold
* wait for gc_grace_time

Cheers,
Stefano



On Tue, May 26, 2015 at 12:51 PM, Jean Tremblay 
jean.tremb...@zen-innovations.com wrote:

  I played around with these settings, namely the tombstone_threshold, and
 it **eventually** triggered a Tombstone Compaction.
 Now I see that getting rid of these Tombstone is a process which takes
 some times.

  I would like to be able to schedule a Tombstone Compaction.

  Is there a way to trigger immediately a Tombstone Compaction on a table
 which is using LeveledCompactionStrategy?


  Thanks a lot for your help

  Jean

  On 14 May 2015, at 22:45 , Nate McCall n...@thelastpickle.com wrote:

  You can make LCS more aggressive with tombstone-only compactions via
 seting unchecked_tombstone_compaction=true and turn down
 tombstone_threshold to 0.05 (maybe going up or down as needed). Details on
 both can be found here:
 http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html

  As for monitoring tombstones, there is a tombstoneScannedHistogram on
 ColumnFamilyMetrics which measures how many tombstones were discarded
 during reads.

  Also, you should take a couple of SSTables from production and use the
 sstablemetadata utility specifically looking at Estimated droppable
 tombstones and Estimated tombstone drop times output from such.

  Spend some time experimenting with those settings incrementally. Finding
 the sweet spot is different for each workload will make a huge difference
 in overall performance.



 On Thu, May 14, 2015 at 8:06 AM, Jean Tremblay 
 jean.tremb...@zen-innovations.com wrote:
 
  Hi,
 
  I’m using Cassandra 2.1.4 with a table using LeveledCompactionStrategy.
  Often I need to delete many rows and I want to make sure I don’t have
 too many tombstones.
 
  How does one get rid of tombstones in a table using LCS?
  How can we monitor how many tombstones are around?
 
  Thanks for your help.
 
  Jean




 --
 -
 Nate McCall
 Austin, TX
 @zznate

 Co-Founder  Sr. Technical Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com





Re: Leveled Compaction Strategy with a really intensive delete workload

2015-05-25 Thread Stefano Ortolani
Hi all,

Thanks for your answers! Yes, I agree that a delete intensive workload is not 
something Cassandra is designed for.

Unfortunately this is to cope with some unexpected data transformations that I 
hope are a temporary thing.

We chose LCS strategy because of really wide rows which were spanning several 
SStables with other compaction strategies (and hence leading to high latency 
read queries).

I was honestly thinking of scraping and rebuilding the SStable from scratch if 
this workload is confirmed to be temporary. Knowing the answer to my question 
above would help second guess my a decision a bit less :)

Cheers,
Stefano

 On Mon, May 25, 2015 at 9:52 AM, Jason Wee peich...@gmail.com wrote:
 , due to a really intensive delete workloads, the SSTable is promoted to 
 t..
 
 Is cassandra design for *delete* workloads? doubt so. Perhaps looking at some 
 other alternative like ttl?
 
 jason
 
 On Mon, May 25, 2015 at 10:12 AM, Manoj Khangaonkar khangaon...@gmail.com 
 wrote:
 Hi,
 
 For a delete intensive workload ( translate to write intensive), is there 
 any reason to use leveled compaction ? The recommendation seems to be that 
 leveled compaction is suited for read intensive workloads.
 
 Depending on your use case, you might better of with data tiered or size 
 tiered strategy.
 
 regards
 
 regards
 
 On Sun, May 24, 2015 at 10:50 AM, Stefano Ortolani ostef...@gmail.com 
 wrote:
 Hi all,
 
 I have a question re leveled compaction strategy that has been bugging me 
 quite a lot lately. Based on what I understood, a compaction takes place 
 when the SSTable gets to a specific size (10 times the size of its previous 
 generation). My question is about an edge case where, due to a really 
 intensive delete workloads, the SSTable is promoted to the next level (say 
 L1) and its size, because of the many evicted tombstones, fall back to 1/10 
 of its size (hence to a size compatible to the previous generation, L0). 
 
 What happens in this case? If the next major compaction is set to happen 
 when the SSTable is promoted to L2, well, that might take too long and too 
 many tobmstones could then appear in the meanwhile (and queries might 
 subsequently fail). Wouldn't be more correct to flag the SStable's 
 generation to its previous value (namely, not changing it even if a major 
 compaction took place)?
 
 Regards,
 Stefano Ortolani
 
 
 
 -- 
 http://khangaonkar.blogspot.com/



Re: Leveled Compaction Strategy with a really intensive delete workload

2015-05-25 Thread Stefano Ortolani
Ok, I am reading a bit more about compaction subproperties here (
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html)
and it seems that tombstone_threshold and unchecked_tombstone_compaction
might come handy.

Does anybody know if changing any of these values (via ALTER) is possible
without downtime, and how fast those values are picked up?

Cheers,
Stefano


On Mon, May 25, 2015 at 1:32 PM, Stefano Ortolani ostef...@gmail.com
wrote:

 Hi all,

 Thanks for your answers! Yes, I agree that a delete intensive workload is
 not something Cassandra is designed for.

 Unfortunately this is to cope with some unexpected data transformations
 that I hope are a temporary thing.

 We chose LCS strategy because of really wide rows which were spanning
 several SStables with other compaction strategies (and hence leading to
 high latency read queries).

 I was honestly thinking of scraping and rebuilding the SStable from
 scratch if this workload is confirmed to be temporary. Knowing the answer
 to my question above would help second guess my a decision a bit less :)

 Cheers,
 Stefano

 On Mon, May 25, 2015 at 9:52 AM, Jason Wee peich...@gmail.com wrote:

 , due to a really intensive delete workloads, the SSTable is promoted
 to t..

 Is cassandra design for *delete* workloads? doubt so. Perhaps looking at
 some other alternative like ttl?

 jason

 On Mon, May 25, 2015 at 10:12 AM, Manoj Khangaonkar 
 khangaon...@gmail.com wrote:

 Hi,

 For a delete intensive workload ( translate to write intensive), is
 there any reason to use leveled compaction ? The recommendation seems to be
 that leveled compaction is suited for read intensive workloads.

 Depending on your use case, you might better of with data tiered or size
 tiered strategy.

 regards

 regards

 On Sun, May 24, 2015 at 10:50 AM, Stefano Ortolani ostef...@gmail.com
 wrote:

 Hi all,

 I have a question re leveled compaction strategy that has been bugging
 me quite a lot lately. Based on what I understood, a compaction takes place
 when the SSTable gets to a specific size (10 times the size of its previous
 generation). My question is about an edge case where, due to a really
 intensive delete workloads, the SSTable is promoted to the next level (say
 L1) and its size, because of the many evicted tombstones, fall back to 1/10
 of its size (hence to a size compatible to the previous generation, L0).

 What happens in this case? If the next major compaction is set to
 happen when the SSTable is promoted to L2, well, that might take too long
 and too many tobmstones could then appear in the meanwhile (and queries
 might subsequently fail). Wouldn't be more correct to flag the SStable's
 generation to its previous value (namely, not changing it even if a major
 compaction took place)?

 Regards,
 Stefano Ortolani




 --
 http://khangaonkar.blogspot.com/






Leveled Compaction Strategy with a really intensive delete workload

2015-05-24 Thread Stefano Ortolani
Hi all,

I have a question re leveled compaction strategy that has been bugging me
quite a lot lately. Based on what I understood, a compaction takes place
when the SSTable gets to a specific size (10 times the size of its previous
generation). My question is about an edge case where, due to a really
intensive delete workloads, the SSTable is promoted to the next level (say
L1) and its size, because of the many evicted tombstones, fall back to 1/10
of its size (hence to a size compatible to the previous generation, L0).

What happens in this case? If the next major compaction is set to happen
when the SSTable is promoted to L2, well, that might take too long and too
many tobmstones could then appear in the meanwhile (and queries might
subsequently fail). Wouldn't be more correct to flag the SStable's
generation to its previous value (namely, not changing it even if a major
compaction took place)?

Regards,
Stefano Ortolani


Re: Recommissioned a node

2015-02-12 Thread Stefano Ortolani
Definitely, I think the very same re this issue.

On Thu, Feb 12, 2015 at 7:04 AM, Eric Stevens migh...@gmail.com wrote:

 I definitely find it surprising that a node which was decommissioned is
 willing to rejoin a cluster.  I can't think of any legitimate scenario
 where you'd want that, and I'm surprised the node doesn't track that it was
 decommissioned and refuse to rejoin without at least a -D flag to force it.

 Way too easy for a node to get restarted with, for example, a naive
 service status checker, or a scheduled reboot after a kernel upgrade, or so
 forth.You may also have been decommissioning the node because of
 hardware issues, in which case you may also be threatening the stability or
 performance characteristics of the cluster, and at absolute best you have
 short term consistency issues, and near-100% overstreaming to get the node
 decommissioned again.

 IMO, especially with the threat to unrecoverable consistency violations,
 this should be a critical bug.

 On Wed, Feb 11, 2015 at 12:39 PM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 And after decreasing your RF (rare but happens)

 On Wed Feb 11 2015 at 11:31:38 AM Robert Coli rc...@eventbrite.com
 wrote:

 On Wed, Feb 11, 2015 at 11:20 AM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 It could, because the tombstones that mark data deleted may have been
 removed.  There would be nothing that says this data is gone.

 If you're worried about it, turn up your gc grace seconds.  Also, don't
 revive nodes back into a cluster with old data sitting on them.


 Also, run cleanup after range movements :

 https://issues.apache.org/jira/browse/CASSANDRA-7764

 =Rob






Re: Recommissioned a node

2015-02-11 Thread Stefano Ortolani
Hi Eric,

thanks for your answer. The reason why it got recommissioned was simply
because the machine got restarted (with auto_bootstrap set to to true). A
cleaner, and correct, recommission would have just required wiping the data
folder, am I correct? Or would I have needed to change something else in
the node configuration?

Cheers,
Stefano

On Wed, Feb 11, 2015 at 6:47 AM, Eric Stevens migh...@gmail.com wrote:

 AFAIK it should be ok after the repair completed (it was missing all
 writes while it was decommissioning and while it was offline, and nobody
 would have been keeping hinted handoffs for it, so repair was the right
 thing to do).  Unless RF=N you're now due for a cleanup on the other nodes.

 Generally speaking though this was probably not a good idea.  When the
 node came back online, it rejoined the cluster immediately and would have
 been serving client requests without having a consistent view of the data.
 A safer approach would be to wipe the data directory and bootstrap it as a
 clean new member.

 I'm curious what prompted that cycle of decommission then recommission.

 On Tue, Feb 10, 2015 at 10:13 PM, Stefano Ortolani ostef...@gmail.com
 wrote:

 Hi,

 I recommissioned a node after decommissioningit.
 That happened (1) after a successfull decommission (checked), (2) without
 wiping the data directory on the node, (3) simply by restarting the
 cassandra service. The node now reports himself healty and up and running

 Knowing that I issued the repair command and patiently waited for its
 completion, can I assume the cluster, and its internals (replicas, balance
 between those) to be healthy and as new?

 Regards,
 Stefano