date:20170516

Re: Long running compaction on huge hint table.

2017-05-16 Thread varun saluja

Hi,

Could see intermittent GCs and mutation drops.

*System log reports:*

INFO  [Service Thread]  GCInspector.java:252 - ParNew GC in 3816ms.  CMS
Old Gen: 4663180720 -> 5520012520; Par Eden Space: 1718091776 -> 0; Par
Survivor Space: 0 -> 214695936
INFO  [ScheduledTasks:1] MessagingService.java:888 - 228 MUTATION messages
dropped in last 5000ms

PS: As of now , there is no significant load on our cluster. The only load
is of these hints been replayed.

Can you Please help.

Regards,
Varun Saluja

On 16 May 2017 at 18:28, varun saluja  wrote:

> Hi Nitan,
>
> Thanks for response.
>
> Yes, I could see mutation drops and increase count in system.hints. Is
> there any way , i can proceed to truncate hints like using nodetool
> truncatehints.
>
>
> Regards,
> Varun Saluja
>
> On 16 May 2017 at 17:52, Nitan Kainth  wrote:
>
>> Do you see mutation drops?
>> Select count from system.hints; is it increasing?
>>
>> Sent from my iPhone
>>
>> On May 16, 2017, at 5:52 AM, varun saluja  wrote:
>>
>> Hi Experts,
>>
>> We are facing issue on production cluster. Compaction on system.hint
>> table is running from last 2 days.
>>
>>
>> pending tasks: 1
>>compaction type   keyspace   table completed  total
>>   unit   progress
>>   Compaction system   hints   20623021829
>> *877874092407*   bytes  2.35%
>> Active compaction remaining time :   0h27m15s
>>
>>
>> Active compaction remaining time shows in minutes.  But, this is job is
>> running like indefinitely.
>>
>> We have 3 node cluster V 2.1.7. And we ran  write intensive job last week
>> on particular table.
>> Compaction on this table finished but hint table size is growing
>> continuously.
>>
>> Can someone Please help me.
>>
>>
>> Thanks & Regards,
>> Varun Saluja
>>
>>
>

Re: Long running compaction on huge hint table.

2017-05-16 Thread varun saluja

Hi Nitan,

Rolling reatart did not helped. Same compaction status after restart.
No other processes running here. These are dedicated cassandra nodes.
Sent from my iPhone

> On 16-May-2017, at 7:16 PM, Nitan Kainth  wrote:
> 
> Have you tried rolling restart?
> Any agent or other process hogging system?
> 
> Sent from my iPhone
> 
>> On May 16, 2017, at 7:58 AM, varun saluja  wrote:
>> 
>> Hi Nitan,
>> 
>> Thanks for response.
>> 
>> Yes, I could see mutation drops and increase count in system.hints. Is there 
>> any way , i can proceed to truncate hints like using nodetool truncatehints.
>> 
>> 
>> Regards,
>> Varun Saluja
>> 
>>> On 16 May 2017 at 17:52, Nitan Kainth  wrote:
>>> Do you see mutation drops?
>>> Select count from system.hints; is it increasing?
>>> 
>>> Sent from my iPhone
>>> 
 On May 16, 2017, at 5:52 AM, varun saluja  wrote:

 Hi Experts,

 We are facing issue on production cluster. Compaction on system.hint table 
 is running from last 2 days.

 pending tasks: 1
compaction type   keyspace   table completed  total 
  unit   progress
   Compaction system   hints   20623021829   877874092407   
 bytes  2.35%
 Active compaction remaining time :   0h27m15s

 Active compaction remaining time shows in minutes.  But, this is job is 
 running like indefinitely.

 We have 3 node cluster V 2.1.7. And we ran  write intensive job last week 
 on particular table.
 Compaction on this table finished but hint table size is growing 
 continuously.

 Can someone Please help me.

 Thanks & Regards,
 Varun Saluja

>>

Re: Bootstraping a Node With a Newer Version

2017-05-16 Thread daemeon reiydelle

What makes you think you cannot upgrade the kernel?

“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence

sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On May 16, 2017 5:27 AM, "Shalom Sagges"  wrote:

> Hi All,
>
> Hypothetically speaking, let's say I want to upgrade my Cassandra cluster,
> but I also want to perform a major upgrade to the kernel of all nodes.
> In order to upgrade the kernel, I need to reinstall the server, hence lose
> all data on the node.
>
> My question is this, after reinstalling the server with the new kernel,
> can I first install the upgraded Cassandra version and then bootstrap it to
> the cluster?
>
> Since there's already no data on the node, I wish to skip the agonizing
> sstable upgrade process.
>
> Does anyone know if this is doable?
>
> Thanks!
>
>
>
> Shalom Sagges
> DBA
> T: +972-74-700-4035 <+972%2074-700-4035>
>  
>  We Create Meaningful Connections
>
>
>
> This message may contain confidential and/or privileged information.
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on this
> message or any information herein.
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.
>

Re: Decommissioned node cluster shows as down

2017-05-16 Thread Hannu Kröger

That’s weird. I thought decommission would ultimately remove the node from the 
cluster because the token(s) should be removed from the ring and data should be 
streamed to new owners. “DN” is IMHO not a state where the node should end up 
in. 

Hannu

> On 16 May 2017, at 19:05, suraj pasuparthy  wrote:
> 
> Yes, you have to run a nodetool removenode to decomission completely.. this 
> will also allow another node with the same ip different HashId to join the 
> cluster..
> 
> Thanks
> -suraj
> On Tue, May 16, 2017 at 9:01 AM Mark Furlong  > wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> I have a node I decommissioned on a large ring using 2.1.12. The node 
> completed the decommission process and is no longer communicating with the 
> rest of the cluster. However when I run a nodetool status on any node in the 
> cluster it shows
> 
> the node as ‘DN’. Why is this and should I just run a removenode now?
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Thanks,
> 
> 
> 
> Mark Furlong
> 
> 
> 
> 
> 
> 
> 
> Sr. Database Administrator
> 
> 
> 
> 
> 
> 
> 
> mfurl...@ancestry.com 
> 
> 
> M: 801-859-7427
> 
> 
> 
> O: 801-705-7115
> 
> 
> 
> 1300 W Traverse Pkwy
> 
> 
> 
> Lehi, UT 84043
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani

That is another way to see the question: are reverse iterators range
tombstone aware? Yes.
That is why I am puzzled by this afore-mentioned behavior.
I would expect them to handle this case more gracefully.

Cheers,
Stefano

On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:

> Hannu,
>
> How can you read a partition in reverse?
>
> Sent from my iPhone
>
> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
> >
> > Well, I’m guessing that Cassandra doesn't really know if the range
> tombstone is useful for this or not.
> >
> > In many cases it might be that the partition contains data that is
> within the range of the tombstone but is newer than the tombstone and
> therefore it might be still be returned. Scanning through deleted data can
> be avoided by reading the partition in reverse (if all the deleted data is
> in the beginning of the partition). Eventually you will still end up
> reading a lot of tombstones but you will get a lot of live data first and
> the implicit query limit of 1 probably is reached before you get to the
> tombstones. Therefore you will get an immediate answer.
> >
> > Does it make sense?
> >
> > Hannu
> >
> >> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
> >>
> >> Hi all,
> >>
> >> I am seeing inconsistencies when mixing range tombstones, wide
> partitions, and reverse iterators.
> >> I still have to understand if the behaviour is to be expected hence the
> message on the mailing list.
> >>
> >> The situation is conceptually simple. I am using a table defined as
> follows:
> >>
> >> CREATE TABLE test_cql.test_cf (
> >>  hash blob,
> >>  timeid timeuuid,
> >>  PRIMARY KEY (hash, timeid)
> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
> >>
> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a
> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
> _half_ of that partition by executing the query below, and restart the node:
> >>
> >> DELETE
> >> FROM test_cql.test_cf
> >> WHERE hash = x AND timeid < y;
> >>
> >> If I keep compactions disabled the following query timeouts (takes more
> than 10 seconds to
> >> succeed):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid ASC;
> >>
> >> While the following returns immediately (obviously because no deleted
> data is ever read):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid DESC;
> >>
> >> If I force a compaction the problem is gone, but I presume just because
> the data is rearranged.
> >>
> >> It seems to me that reading by ASC does not make use of the range
> tombstone until C* reads the
> >> last sstables (which actually contains the range tombstone and is
> flushed at node restart), and it wastes time reading all rows that are
> actually not live anymore.
> >>
> >> Is this expected? Should the range tombstone actually help in these
> cases?
> >>
> >> Thanks a lot!
> >> Stefano
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>

Re: Long running compaction on huge hint table.

2017-05-16 Thread Nitan Kainth

Yes but it means data has to be replicated using repair.

Hints are out come of unhealthy nodes, focus on finding why you have mutation 
drops, is it node, io or network etc. ideally you shouldn't see increasing 
hints all the time.

Sent from my iPhone

> On May 16, 2017, at 7:58 AM, varun saluja  wrote:
> 
> Hi Nitan,
> 
> Thanks for response.
> 
> Yes, I could see mutation drops and increase count in system.hints. Is there 
> any way , i can proceed to truncate hints like using nodetool truncatehints.
> 
> 
> Regards,
> Varun Saluja
> 
>> On 16 May 2017 at 17:52, Nitan Kainth  wrote:
>> Do you see mutation drops?
>> Select count from system.hints; is it increasing?
>> 
>> Sent from my iPhone
>> 
>>> On May 16, 2017, at 5:52 AM, varun saluja  wrote:
>>> 
>>> Hi Experts,
>>> 
>>> We are facing issue on production cluster. Compaction on system.hint table 
>>> is running from last 2 days.
>>> 
>>> 
>>> pending tasks: 1
>>>compaction type   keyspace   table completed  total  
>>> unit   progress
>>>   Compaction system   hints   20623021829   877874092407   
>>> bytes  2.35%
>>> Active compaction remaining time :   0h27m15s
>>> 
>>> 
>>> Active compaction remaining time shows in minutes.  But, this is job is 
>>> running like indefinitely.
>>> 
>>> We have 3 node cluster V 2.1.7. And we ran  write intensive job last week 
>>> on particular table.
>>> Compaction on this table finished but hint table size is growing 
>>> continuously.
>>> 
>>> Can someone Please help me.
>>> 
>>> 
>>> Thanks & Regards,
>>> Varun Saluja
>>> 
>

Re: Long running compaction on huge hint table.

2017-05-16 Thread Nitan Kainth

Have you tried rolling restart?
Any agent or other process hogging system?

Sent from my iPhone

> On May 16, 2017, at 7:58 AM, varun saluja  wrote:
> 
> Hi Nitan,
> 
> Thanks for response.
> 
> Yes, I could see mutation drops and increase count in system.hints. Is there 
> any way , i can proceed to truncate hints like using nodetool truncatehints.
> 
> 
> Regards,
> Varun Saluja
> 
>> On 16 May 2017 at 17:52, Nitan Kainth  wrote:
>> Do you see mutation drops?
>> Select count from system.hints; is it increasing?
>> 
>> Sent from my iPhone
>> 
>>> On May 16, 2017, at 5:52 AM, varun saluja  wrote:
>>> 
>>> Hi Experts,
>>> 
>>> We are facing issue on production cluster. Compaction on system.hint table 
>>> is running from last 2 days.
>>> 
>>> 
>>> pending tasks: 1
>>>compaction type   keyspace   table completed  total  
>>> unit   progress
>>>   Compaction system   hints   20623021829   877874092407   
>>> bytes  2.35%
>>> Active compaction remaining time :   0h27m15s
>>> 
>>> 
>>> Active compaction remaining time shows in minutes.  But, this is job is 
>>> running like indefinitely.
>>> 
>>> We have 3 node cluster V 2.1.7. And we ran  write intensive job last week 
>>> on particular table.
>>> Compaction on this table finished but hint table size is growing 
>>> continuously.
>>> 
>>> Can someone Please help me.
>>> 
>>> 
>>> Thanks & Regards,
>>> Varun Saluja
>>> 
>

Re: Non-zero nodes are marked as down after restarting cassandra process

2017-05-16 Thread Andrew Jorgensen

Thanks for the info!

When you say "overall stability problems due to some bugs", can you
elaborate on if those were bugs in cassandra that were fixed due to an
upgrade or bugs in your own code and how you used cassandra. If the latter
would  it be possible to highlight what the most impactful fix was from the
usage side.

As far as I can tell there are no dropped messages, there are some pending
Compactions and a few Native-Transport_Request in the all time blocked
column.

Thanks!

Andrew Jorgensen
@ajorgensen

On Wed, Mar 1, 2017 at 12:58 PM, benjamin roth  wrote:

> You should always drain nodes before stopping the daemon whenever
> possible. This avoids commitlog replay on startup. This can take a while.
> But according to your description commit log replay seems not to be the
> cause.
>
> I once had a similar effect. Some nodes appeared down for some other nodes
> and up for others. At that time the cluster had overall stability problems
> due to some bugs. After those bugs have gone, I haven't seen this effect
> any more.
>
> If that happens again to you, you could check your logs or "nodetool
> tpstats" for dropped messages, watch out for suspicious network-related
> logs and the load of your nodes in general.
>
> 2017-03-01 17:36 GMT+01:00 Ben Dalling :
>
>> Hi Andrew,
>>
>> We were having problems with gossip TCP connections being held open and
>> changed our SOP for stopping cassandra to being:
>>
>> nodetool disablegossip
>> nodetool drain
>> service cassandra stop
>>
>> This seemed to close down the gossip cleanly (the nodetool drain is
>> advised as well) and meant that the node rejoined the cluster fine after
>> issuing "service cassandra start".
>>
>> *Ben*
>>
>> On 1 March 2017 at 16:29, Andrew Jorgensen 
>> wrote:
>>
>>> Helllo,
>>>
>>> I have a cassandra cluster running on cassandra 3.0.3 and am seeing some
>>> strange behavior that I cannot explain when restarting cassandra nodes. The
>>> cluster is currently setup in a single datacenter and consists of 55 nodes.
>>> I am currently in the process of restarting nodes in the cluster but have
>>> noticed that after restarting the cassandra process with `service cassandra
>>> start; service cassandra stop` when the node comes back and I run `nodetool
>>> status` there is usually a non-zero number of nodes in the rest of the
>>> cluster that are marked as DN. If I got to another node in the cluster,
>>> from its perspective all nodes included the restarted one are marked as UN.
>>> It seems to take ~15 to 20 minutes before the restarted node is updated to
>>> show all nodes as UN. During the 15 minutes writes and reads . to the
>>> cluster appear to be degraded and do not recover unless I stop the
>>> cassandra process again or wait for all nodes to be marked as UN. The
>>> cluster also has 3 seed nodes which during this process are up and
>>> available the whole time.
>>>
>>> I have also tried doing `gossipinfo` on the restarted node and according
>>> to the output all nodes have a status of NORMAL. Has anyone seen this
>>> before and is there anything I can do to fix/reduce the impact of running a
>>> restart on a cassandra node?
>>>
>>> Thanks,
>>> Andrew Jorgensen
>>> @ajorgensen
>>>
>>
>>
>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger

Hello,

If you mean how to construct a query like that: you use ORDER BY clause with 
SELECT which is reverse to the default just like in the example below? If the 
table is constructed with "clustering order by (timeid ASC)” and you query 
“SELECT ... ORDER BY timeid DESC”, then the partition is read backwards. I 
don’t know how it is technically done but it is apparently slightly slower then 
reading partition normally.

Hannu 

> On 16 May 2017, at 17:29, Nitan Kainth  wrote:
> 
> Hannu,
> 
> How can you read a partition in reverse? 
> 
> Sent from my iPhone
> 
>> On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
>> 
>> Well, I’m guessing that Cassandra doesn't really know if the range tombstone 
>> is useful for this or not. 
>> 
>> In many cases it might be that the partition contains data that is within 
>> the range of the tombstone but is newer than the tombstone and therefore it 
>> might be still be returned. Scanning through deleted data can be avoided by 
>> reading the partition in reverse (if all the deleted data is in the 
>> beginning of the partition). Eventually you will still end up reading a lot 
>> of tombstones but you will get a lot of live data first and the implicit 
>> query limit of 1 probably is reached before you get to the tombstones. 
>> Therefore you will get an immediate answer.
>> 
>> Does it make sense?
>> 
>> Hannu
>> 
>>> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
>>> 
>>> Hi all,
>>> 
>>> I am seeing inconsistencies when mixing range tombstones, wide partitions, 
>>> and reverse iterators.
>>> I still have to understand if the behaviour is to be expected hence the 
>>> message on the mailing list.
>>> 
>>> The situation is conceptually simple. I am using a table defined as follows:
>>> 
>>> CREATE TABLE test_cql.test_cf (
>>> hash blob,
>>> timeid timeuuid,
>>> PRIMARY KEY (hash, timeid)
>>> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> 
>>> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>>> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
>>> _half_ of that partition by executing the query below, and restart the node:
>>> 
>>> DELETE 
>>> FROM test_cql.test_cf 
>>> WHERE hash = x AND timeid < y;
>>> 
>>> If I keep compactions disabled the following query timeouts (takes more 
>>> than 10 seconds to 
>>> succeed):
>>> 
>>> SELECT * 
>>> FROM test_cql.test_cf 
>>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>>> ORDER BY timeid ASC;
>>> 
>>> While the following returns immediately (obviously because no deleted data 
>>> is ever read):
>>> 
>>> SELECT * 
>>> FROM test_cql.test_cf 
>>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>>> ORDER BY timeid DESC;
>>> 
>>> If I force a compaction the problem is gone, but I presume just because the 
>>> data is rearranged.
>>> 
>>> It seems to me that reading by ASC does not make use of the range tombstone 
>>> until C* reads the
>>> last sstables (which actually contains the range tombstone and is flushed 
>>> at node restart), and it wastes time reading all rows that are actually not 
>>> live anymore. 
>>> 
>>> Is this expected? Should the range tombstone actually help in these cases?
>>> 
>>> Thanks a lot!
>>> Stefano
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Long running compaction on huge hint table.

2017-05-16 Thread varun saluja

Thanks a lot Jeff.

You have explaned very well here. We have consitency as local quorum. Will 
follow truncate hints and repair therafter.

I hope this brings cluster in stable state

Thanks again.

Regards,
Varun Saluja

Sent from my iPhone

> On 16-May-2017, at 8:42 PM, Jeff Jirsa  wrote:
> 
> 
> In Cassandra versions up to 3.0, hints are stored within a table, where the 
> partition key is the host ID of the server for which the hints are stored.
> 
> In such a data model, accumulating 800GB of hints is almost certain to cause 
> very wide rows, which will in turn cause GC pressure when you attempt to read 
> the hints for delivery. This will cause GC pauses, which will cause hints to 
> fail to be delivered, which will cause more hints to be stored. This is bad.
> 
> In 3.0, hints were rewritten to work around this design flaw. In 2.1, your 
> most likely corrective course is to use 'nodetool truncatehints' on all 
> servers, followed by 'nodetool repair' to deliver the data you lost by 
> truncating the hints.
> 
> NOTE: this is ONLY safe if you wrote with a consistency level stronger than 
> CL:ANY. If you wrote this data with CL:ANY, you may lose data if you truncate 
> hints.
> 
> - Jeff
> 
>> On 2017-05-16 06:50 (-0700), varun saluja  wrote: 
>> Thanks for update.
>> I could see lot of io waits. This causing  Gc and mutation drops .
>> But as i mentioned we do not have high load for now. Hint replays are 
>> creating such high disk I/O.
>> compactionstats show very high hint bytes like 780gb around. Is this normal?
>> 
>> Just mentioning we are using flash disks.
>> 
>> In such case, if i run truncatehints , will it remove or decrease size of 
>> hints bytes in compaction stats. I can trigger repair therafter.
>> Please let me know if any recommendation on same.
>> 
>> Also , table which we dumped from kafka which created this much hints and 
>> compaction pendings is also dropped today. Because we have to redump table 
>> again once cluster is stable.
>> 
>> Regards,
>> Varun
>> 
>> Sent from my iPhone
>> 
>>> On 16-May-2017, at 6:59 PM, Nitan Kainth  wrote:
>>> 
>>> Yes but it means data has to be replicated using repair.
>>> 
>>> Hints are out come of unhealthy nodes, focus on finding why you have 
>>> mutation drops, is it node, io or network etc. ideally you shouldn't see 
>>> increasing hints all the time.
>>> 
>>> Sent from my iPhone
>>> 
 On May 16, 2017, at 7:58 AM, varun saluja  wrote:
 
 Hi Nitan,
 
 Thanks for response.
 
 Yes, I could see mutation drops and increase count in system.hints. Is 
 there any way , i can proceed to truncate hints like using nodetool 
 truncatehints.
 
 
 Regards,
 Varun Saluja
 
> On 16 May 2017 at 17:52, Nitan Kainth  wrote:
> Do you see mutation drops?
> Select count from system.hints; is it increasing?
> 
> Sent from my iPhone
> 
>> On May 16, 2017, at 5:52 AM, varun saluja  wrote:
>> 
>> Hi Experts,
>> 
>> We are facing issue on production cluster. Compaction on system.hint 
>> table is running from last 2 days.
>> 
>> 
>> pending tasks: 1
>>   compaction type   keyspace   table completed  total
>>   unit   progress
>>  Compaction system   hints   20623021829   877874092407  
>>  bytes  2.35%
>> Active compaction remaining time :   0h27m15s
>> 
>> 
>> Active compaction remaining time shows in minutes.  But, this is job is 
>> running like indefinitely.
>> 
>> We have 3 node cluster V 2.1.7. And we ran  write intensive job last 
>> week on particular table.
>> Compaction on this table finished but hint table size is growing 
>> continuously.
>> 
>> Can someone Please help me.
>> 
>> 
>> Thanks & Regards,
>> Varun Saluja
>> 
 
>> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Bootstraping a Node With a Newer Version

2017-05-16 Thread Jeff Jirsa



On 2017-05-16 05:27 (-0700), Shalom Sagges  wrote: 
> Hi All,
> 
> Hypothetically speaking, let's say I want to upgrade my Cassandra cluster,
> but I also want to perform a major upgrade to the kernel of all nodes.
> In order to upgrade the kernel, I need to reinstall the server, hence lose
> all data on the node.
> 

That sounds unpleasant - really the case that you can't upgrade a kernel 
without wiping data? Even AWS ephemeral instances can handle a reboot in place 
without ephemeral drive reset?

> My question is this, after reinstalling the server with the new kernel, can
> I first install the upgraded Cassandra version and then bootstrap it to the
> cluster?
> 
> Since there's already no data on the node, I wish to skip the agonizing
> sstable upgrade process.
> 
> Does anyone know if this is doable?

Not supported, and not generally a good idea.



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani

Hi all,

I am seeing inconsistencies when mixing range tombstones, wide partitions,
and reverse iterators.
I still have to understand if the behaviour is to be expected hence the
message on the mailing list.

The situation is conceptually simple. I am using a table defined as follows:

CREATE TABLE test_cql.test_cf (
  hash blob,
  timeid timeuuid,
  PRIMARY KEY (hash, timeid)
) WITH CLUSTERING ORDER BY (timeid ASC)
  AND compaction = {'class' : 'LeveledCompactionStrategy'};

I then proceed by loading 2/3GB from 3 sstables which I know contain a
really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
_half_ of that partition by executing the query below, and restart the node:

DELETE
FROM test_cql.test_cf
WHERE hash = x AND timeid < y;

If I keep compactions disabled the following query timeouts (takes more
than 10 seconds to
succeed):

SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
ORDER BY timeid ASC;

While the following returns immediately (obviously because no deleted data
is ever read):

SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
ORDER BY timeid DESC;

If I force a compaction the problem is gone, but I presume just because the
data is rearranged.

It seems to me that reading by ASC does not make use of the range tombstone
until C* reads the
last sstables (which actually contains the range tombstone and is flushed
at node restart), and it wastes time reading all rows that are actually not
live anymore.

Is this expected? Should the range tombstone actually help in these cases?

Thanks a lot!
Stefano

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Nitan Kainth

Hannu,

How can you read a partition in reverse? 

Sent from my iPhone

> On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
> 
> Well, I’m guessing that Cassandra doesn't really know if the range tombstone 
> is useful for this or not. 
> 
> In many cases it might be that the partition contains data that is within the 
> range of the tombstone but is newer than the tombstone and therefore it might 
> be still be returned. Scanning through deleted data can be avoided by reading 
> the partition in reverse (if all the deleted data is in the beginning of the 
> partition). Eventually you will still end up reading a lot of tombstones but 
> you will get a lot of live data first and the implicit query limit of 1 
> probably is reached before you get to the tombstones. Therefore you will get 
> an immediate answer.
> 
> Does it make sense?
> 
> Hannu
> 
>> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
>> 
>> Hi all,
>> 
>> I am seeing inconsistencies when mixing range tombstones, wide partitions, 
>> and reverse iterators.
>> I still have to understand if the behaviour is to be expected hence the 
>> message on the mailing list.
>> 
>> The situation is conceptually simple. I am using a table defined as follows:
>> 
>> CREATE TABLE test_cql.test_cf (
>>  hash blob,
>>  timeid timeuuid,
>>  PRIMARY KEY (hash, timeid)
>> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> 
>> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
>> _half_ of that partition by executing the query below, and restart the node:
>> 
>> DELETE 
>> FROM test_cql.test_cf 
>> WHERE hash = x AND timeid < y;
>> 
>> If I keep compactions disabled the following query timeouts (takes more than 
>> 10 seconds to 
>> succeed):
>> 
>> SELECT * 
>> FROM test_cql.test_cf 
>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>> ORDER BY timeid ASC;
>> 
>> While the following returns immediately (obviously because no deleted data 
>> is ever read):
>> 
>> SELECT * 
>> FROM test_cql.test_cf 
>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>> ORDER BY timeid DESC;
>> 
>> If I force a compaction the problem is gone, but I presume just because the 
>> data is rearranged.
>> 
>> It seems to me that reading by ASC does not make use of the range tombstone 
>> until C* reads the
>> last sstables (which actually contains the range tombstone and is flushed at 
>> node restart), and it wastes time reading all rows that are actually not 
>> live anymore. 
>> 
>> Is this expected? Should the range tombstone actually help in these cases?
>> 
>> Thanks a lot!
>> Stefano
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani

Hi Hannu,

the piece of data in question is older. In my example the tombstone is the
newest piece of data.
Since a range tombstone has information re the clustering key ranges, and
the data is clustering key sorted, I would expect a linear scan not to be
necessary.

On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:

> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip
> bigger regions of deleted data based on range tombstone. If some piece of
> data in a partition is newer than the tombstone, then it cannot be skipped.
> Therefore some partition level statistics of cell ages would need to be
> kept in the column index for the skipping and that is probably not there.
>
> Hannu
>
> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
>
> That is another way to see the question: are reverse iterators range
> tombstone aware? Yes.
> That is why I am puzzled by this afore-mentioned behavior.
> I would expect them to handle this case more gracefully.
>
> Cheers,
> Stefano
>
> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:
>
>> Hannu,
>>
>> How can you read a partition in reverse?
>>
>> Sent from my iPhone
>>
>> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
>> >
>> > Well, I’m guessing that Cassandra doesn't really know if the range
>> tombstone is useful for this or not.
>> >
>> > In many cases it might be that the partition contains data that is
>> within the range of the tombstone but is newer than the tombstone and
>> therefore it might be still be returned. Scanning through deleted data can
>> be avoided by reading the partition in reverse (if all the deleted data is
>> in the beginning of the partition). Eventually you will still end up
>> reading a lot of tombstones but you will get a lot of live data first and
>> the implicit query limit of 1 probably is reached before you get to the
>> tombstones. Therefore you will get an immediate answer.
>> >
>> > Does it make sense?
>> >
>> > Hannu
>> >
>> >> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I am seeing inconsistencies when mixing range tombstones, wide
>> partitions, and reverse iterators.
>> >> I still have to understand if the behaviour is to be expected hence
>> the message on the mailing list.
>> >>
>> >> The situation is conceptually simple. I am using a table defined as
>> follows:
>> >>
>> >> CREATE TABLE test_cql.test_cf (
>> >>  hash blob,
>> >>  timeid timeuuid,
>> >>  PRIMARY KEY (hash, timeid)
>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> >>
>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a
>> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>> _half_ of that partition by executing the query below, and restart the node:
>> >>
>> >> DELETE
>> >> FROM test_cql.test_cf
>> >> WHERE hash = x AND timeid < y;
>> >>
>> >> If I keep compactions disabled the following query timeouts (takes
>> more than 10 seconds to
>> >> succeed):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid ASC;
>> >>
>> >> While the following returns immediately (obviously because no deleted
>> data is ever read):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid DESC;
>> >>
>> >> If I force a compaction the problem is gone, but I presume just
>> because the data is rearranged.
>> >>
>> >> It seems to me that reading by ASC does not make use of the range
>> tombstone until C* reads the
>> >> last sstables (which actually contains the range tombstone and is
>> flushed at node restart), and it wastes time reading all rows that are
>> actually not live anymore.
>> >>
>> >> Is this expected? Should the range tombstone actually help in these
>> cases?
>> >>
>> >> Thanks a lot!
>> >> Stefano
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>> >
>>
>
>
>

Re: Long running compaction on huge hint table.

2017-05-16 Thread Jeff Jirsa


In Cassandra versions up to 3.0, hints are stored within a table, where the 
partition key is the host ID of the server for which the hints are stored.

In such a data model, accumulating 800GB of hints is almost certain to cause 
very wide rows, which will in turn cause GC pressure when you attempt to read 
the hints for delivery. This will cause GC pauses, which will cause hints to 
fail to be delivered, which will cause more hints to be stored. This is bad.

In 3.0, hints were rewritten to work around this design flaw. In 2.1, your most 
likely corrective course is to use 'nodetool truncatehints' on all servers, 
followed by 'nodetool repair' to deliver the data you lost by truncating the 
hints.

NOTE: this is ONLY safe if you wrote with a consistency level stronger than 
CL:ANY. If you wrote this data with CL:ANY, you may lose data if you truncate 
hints.

- Jeff

On 2017-05-16 06:50 (-0700), varun saluja  wrote: 
> Thanks for update.
> I could see lot of io waits. This causing  Gc and mutation drops .
> But as i mentioned we do not have high load for now. Hint replays are 
> creating such high disk I/O.
> compactionstats show very high hint bytes like 780gb around. Is this normal?
> 
> Just mentioning we are using flash disks.
> 
> In such case, if i run truncatehints , will it remove or decrease size of 
> hints bytes in compaction stats. I can trigger repair therafter.
> Please let me know if any recommendation on same.
> 
> Also , table which we dumped from kafka which created this much hints and 
> compaction pendings is also dropped today. Because we have to redump table 
> again once cluster is stable.
> 
> Regards,
> Varun
> 
> Sent from my iPhone
> 
> > On 16-May-2017, at 6:59 PM, Nitan Kainth  wrote:
> > 
> > Yes but it means data has to be replicated using repair.
> > 
> > Hints are out come of unhealthy nodes, focus on finding why you have 
> > mutation drops, is it node, io or network etc. ideally you shouldn't see 
> > increasing hints all the time.
> > 
> > Sent from my iPhone
> > 
> >> On May 16, 2017, at 7:58 AM, varun saluja  wrote:
> >> 
> >> Hi Nitan,
> >> 
> >> Thanks for response.
> >> 
> >> Yes, I could see mutation drops and increase count in system.hints. Is 
> >> there any way , i can proceed to truncate hints like using nodetool 
> >> truncatehints.
> >> 
> >> 
> >> Regards,
> >> Varun Saluja
> >> 
> >>> On 16 May 2017 at 17:52, Nitan Kainth  wrote:
> >>> Do you see mutation drops?
> >>> Select count from system.hints; is it increasing?
> >>> 
> >>> Sent from my iPhone
> >>> 
>  On May 16, 2017, at 5:52 AM, varun saluja  wrote:
>  
>  Hi Experts,
>  
>  We are facing issue on production cluster. Compaction on system.hint 
>  table is running from last 2 days.
>  
>  
>  pending tasks: 1
> compaction type   keyspace   table completed  total   
> unit   progress
>    Compaction system   hints   20623021829   877874092407 
>    bytes  2.35%
>  Active compaction remaining time :   0h27m15s
>  
>  
>  Active compaction remaining time shows in minutes.  But, this is job is 
>  running like indefinitely.
>  
>  We have 3 node cluster V 2.1.7. And we ran  write intensive job last 
>  week on particular table.
>  Compaction on this table finished but hint table size is growing 
>  continuously.
>  
>  Can someone Please help me.
>  
>  
>  Thanks & Regards,
>  Varun Saluja
>  
> >> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Read timeouts

2017-05-16 Thread Nitan Kainth

Hi,

We see read timeouts intermittently. Mostly after they have occurred. Timeouts 
are not consistent and does not occur in 100s at a moment. 

1. Does read timeout considered as Dropped Mutation?
2. What is best way to nail down exact issue of scattered timeouts?

Thank you.
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani

Yes, that was my intention but I wanted to cross-check with the ML and the
devs keeping an eye on it first.

On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger  wrote:

> Well,
>
> sstables contain some statistics about the cell timestamps and using that
> information and the tombstone timestamp it might be possible to skip some
> data but I’m not sure that Cassandra currently does that. Maybe it would be
> worth a JIRA ticket and see what the devs think about it. If optimizing
> this case would make sense.
>
> Hannu
>
> On 16 May 2017, at 18:03, Stefano Ortolani  wrote:
>
> Hi Hannu,
>
> the piece of data in question is older. In my example the tombstone is the
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and
> the data is clustering key sorted, I would expect a linear scan not to be
> necessary.
>
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:
>
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>> skip bigger regions of deleted data based on range tombstone. If some piece
>> of data in a partition is newer than the tombstone, then it cannot be
>> skipped. Therefore some partition level statistics of cell ages would need
>> to be kept in the column index for the skipping and that is probably not
>> there.
>>
>> Hannu
>>
>> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
>>
>> That is another way to see the question: are reverse iterators range
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior.
>> I would expect them to handle this case more gracefully.
>>
>> Cheers,
>> Stefano
>>
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:
>>
>>> Hannu,
>>>
>>> How can you read a partition in reverse?
>>>
>>> Sent from my iPhone
>>>
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>> tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is
>>> within the range of the tombstone but is newer than the tombstone and
>>> therefore it might be still be returned. Scanning through deleted data can
>>> be avoided by reading the partition in reverse (if all the deleted data is
>>> in the beginning of the partition). Eventually you will still end up
>>> reading a lot of tombstones but you will get a lot of live data first and
>>> the implicit query limit of 1 probably is reached before you get to the
>>> tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani 
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide
>>> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence
>>> the message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as
>>> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
>>> a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>>> _half_ of that partition by executing the query below, and restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes
>>> more than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>> >>
>>> >> While the following returns immediately (obviously because no deleted
>>> data is ever read):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid DESC;
>>> >>
>>> >> If I force a compaction the problem is gone, but I presume just
>>> because the data is rearranged.
>>> >>
>>> >> It seems to me that reading by ASC does not make use of the range
>>> tombstone until C* reads the
>>> >> last sstables (which actually contains the range tombstone and is
>>> flushed at node restart), and it wastes time reading all rows that are
>>> actually not live anymore.
>>> >>
>>> >> Is this expected? Should the range tombstone actually help in these
>>> cases?
>>> >>
>>> >> Thanks a lot!
>>> >> Stefano
>>> >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> > For additional commands, e-mail:

Re: Long running compaction on huge hint table.

2017-05-16 Thread varun saluja

Thanks Nitan.
Appreciate your help.

Can anyone suggest parameter change or something which can help in this 
situation.

Regards,
Varun 

Sent from my iPhone

> On 16-May-2017, at 7:31 PM, Nitan Kainth  wrote:
> 
> If target table is dropped then you can remove its hints but there could be 
> more hints from other table. If it has tables of your interest , then I won't 
> comment on truncating hints.
> 
> Size of hints depends on Kafka load , looks like you had overloaded the 
> cluster during data load and not hints are just recovering from it. I would 
> say wait until cluster comes to normal state. May be some other expert can 
> suggest an alternate.
> 
> Sent from my iPhone
> 
>> On May 16, 2017, at 8:50 AM, varun saluja  wrote:
>> 
>> Thanks for update.
>> I could see lot of io waits. This causing  Gc and mutation drops .
>> But as i mentioned we do not have high load for now. Hint replays are 
>> creating such high disk I/O.
>> compactionstats show very high hint bytes like 780gb around. Is this normal?
>> 
>> Just mentioning we are using flash disks.
>> 
>> In such case, if i run truncatehints , will it remove or decrease size of 
>> hints bytes in compaction stats. I can trigger repair therafter.
>> Please let me know if any recommendation on same.
>> 
>> Also , table which we dumped from kafka which created this much hints and 
>> compaction pendings is also dropped today. Because we have to redump table 
>> again once cluster is stable.
>> 
>> Regards,
>> Varun
>> 
>> Sent from my iPhone
>> 
>>> On 16-May-2017, at 6:59 PM, Nitan Kainth  wrote:
>>> 
>>> Yes but it means data has to be replicated using repair.
>>> 
>>> Hints are out come of unhealthy nodes, focus on finding why you have 
>>> mutation drops, is it node, io or network etc. ideally you shouldn't see 
>>> increasing hints all the time.
>>> 
>>> Sent from my iPhone
>>> 
 On May 16, 2017, at 7:58 AM, varun saluja  wrote:
 
 Hi Nitan,
 
 Thanks for response.
 
 Yes, I could see mutation drops and increase count in system.hints. Is 
 there any way , i can proceed to truncate hints like using nodetool 
 truncatehints.
 
 
 Regards,
 Varun Saluja
 
> On 16 May 2017 at 17:52, Nitan Kainth  wrote:
> Do you see mutation drops?
> Select count from system.hints; is it increasing?
> 
> Sent from my iPhone
> 
>> On May 16, 2017, at 5:52 AM, varun saluja  wrote:
>> 
>> Hi Experts,
>> 
>> We are facing issue on production cluster. Compaction on system.hint 
>> table is running from last 2 days.
>> 
>> 
>> pending tasks: 1
>>compaction type   keyspace   table completed  total   
>>unit   progress
>>   Compaction system   hints   20623021829   877874092407 
>>   bytes  2.35%
>> Active compaction remaining time :   0h27m15s
>> 
>> 
>> Active compaction remaining time shows in minutes.  But, this is job is 
>> running like indefinitely.
>> 
>> We have 3 node cluster V 2.1.7. And we ran  write intensive job last 
>> week on particular table.
>> Compaction on this table finished but hint table size is growing 
>> continuously.
>> 
>> Can someone Please help me.
>> 
>> 
>> Thanks & Regards,
>> Varun Saluja
>>

Re: Long running compaction on huge hint table.

2017-05-16 Thread Nitan Kainth

You can control compaction with nodetool compactionthroughput but it will just 
slow down compaction and give resources for application, however it's not a fix.

Sent from my iPhone

> On May 16, 2017, at 9:15 AM, varun saluja  wrote:
> 
> Thanks Nitan.
> Appreciate your help.
> 
> Can anyone suggest parameter change or something which can help in this 
> situation.
> 
> Regards,
> Varun 
> 
> Sent from my iPhone
> 
>> On 16-May-2017, at 7:31 PM, Nitan Kainth  wrote:
>> 
>> If target table is dropped then you can remove its hints but there could be 
>> more hints from other table. If it has tables of your interest , then I 
>> won't comment on truncating hints.
>> 
>> Size of hints depends on Kafka load , looks like you had overloaded the 
>> cluster during data load and not hints are just recovering from it. I would 
>> say wait until cluster comes to normal state. May be some other expert can 
>> suggest an alternate.
>> 
>> Sent from my iPhone
>> 
>>> On May 16, 2017, at 8:50 AM, varun saluja  wrote:
>>> 
>>> Thanks for update.
>>> I could see lot of io waits. This causing  Gc and mutation drops .
>>> But as i mentioned we do not have high load for now. Hint replays are 
>>> creating such high disk I/O.
>>> compactionstats show very high hint bytes like 780gb around. Is this normal?
>>> 
>>> Just mentioning we are using flash disks.
>>> 
>>> In such case, if i run truncatehints , will it remove or decrease size of 
>>> hints bytes in compaction stats. I can trigger repair therafter.
>>> Please let me know if any recommendation on same.
>>> 
>>> Also , table which we dumped from kafka which created this much hints and 
>>> compaction pendings is also dropped today. Because we have to redump table 
>>> again once cluster is stable.
>>> 
>>> Regards,
>>> Varun
>>> 
>>> Sent from my iPhone
>>> 
 On 16-May-2017, at 6:59 PM, Nitan Kainth  wrote:
 
 Yes but it means data has to be replicated using repair.
 
 Hints are out come of unhealthy nodes, focus on finding why you have 
 mutation drops, is it node, io or network etc. ideally you shouldn't see 
 increasing hints all the time.
 
 Sent from my iPhone
 
> On May 16, 2017, at 7:58 AM, varun saluja  wrote:
> 
> Hi Nitan,
> 
> Thanks for response.
> 
> Yes, I could see mutation drops and increase count in system.hints. Is 
> there any way , i can proceed to truncate hints like using nodetool 
> truncatehints.
> 
> 
> Regards,
> Varun Saluja
> 
>> On 16 May 2017 at 17:52, Nitan Kainth  wrote:
>> Do you see mutation drops?
>> Select count from system.hints; is it increasing?
>> 
>> Sent from my iPhone
>> 
>>> On May 16, 2017, at 5:52 AM, varun saluja  wrote:
>>> 
>>> Hi Experts,
>>> 
>>> We are facing issue on production cluster. Compaction on system.hint 
>>> table is running from last 2 days.
>>> 
>>> 
>>> pending tasks: 1
>>>compaction type   keyspace   table completed  total  
>>> unit   progress
>>>   Compaction system   hints   20623021829   
>>> 877874092407   bytes  2.35%
>>> Active compaction remaining time :   0h27m15s
>>> 
>>> 
>>> Active compaction remaining time shows in minutes.  But, this is job is 
>>> running like indefinitely.
>>> 
>>> We have 3 node cluster V 2.1.7. And we ran  write intensive job last 
>>> week on particular table.
>>> Compaction on this table finished but hint table size is growing 
>>> continuously.
>>> 
>>> Can someone Please help me.
>>> 
>>> 
>>> Thanks & Regards,
>>> Varun Saluja
>>> 
>

Re: Reg:- Data Modelling Concepts

2017-05-16 Thread @Nandan@

Hi Jon,

We need to keep tracking of all updates like 'User' of our platform can
check what changes made before.
I am thinking in this way..
CREATE TABLE book_info (
book_id uuid,
book_title text,
author_name text,
updated_at timestamp,
PRIMARY KEY(book_id));
This table will contain details about all book with unique updated details.
CREATE TABLE book_title_by_user(
book_title text,
book_id uuid,
user_id uuid ,
ts timeuuid,
primary key(book_title,book_id,user_id,ts));
This table wil contain details of multiple old updates of book which can be
done by mulplie users like MANY TO MANY .

What do you think on this?

On Wed, May 17, 2017 at 9:44 AM, Jonathan Haddad  wrote:

> I don't understand why you need to store the old value a second time.  If
> you know that the value went from A -> B -> C, just store the new value,
> not the old.  You can see that it changed from A->B->C without storing it
> twice.
>
> On Tue, May 16, 2017 at 6:36 PM @Nandan@ 
> wrote:
>
>> The requirement is to create DB in which we have to keep data of Updated
>> values as well as which user update the particular book details and what
>> they update.
>>
>> We are like to create a schema which store book info, as well as the
>> history of the update, made based on book_title, author, publisher, price
>> changed.
>> Like we want to store what was old data and what new data updated.. and
>> also want to check which user updated the relevant change. Because suppose
>> if some changes not made correctly then they can check changes and revert
>> based on old values.
>> We are trying to make a USER based Schema.
>>
>> For example:-
>> id:- 1
>> Name: - Harry Poter
>> Author : - JK Rolling
>>
>> New Update Done by user_id 2:-
>> id :- 1
>> Name:- Harry Pottor
>> Author:- J.K. Rolls
>>
>> Update history also need to store as :-
>> User_id :- 2
>> Old Author :- JK Rolling
>> New Author :- J.K. Rolls
>>
>> So I need to update the details of Book which is done by UPSERT. But also
>> I have to keep details like which user updated and what updated.
>>
>>
>> One thing that helps define the schema is knowing what queries will be
>> made to the database up front.
>> Few queries that the database needs to answer.
>> What are the current details of a book?
>> What is the most recent update to a particular book?
>> What are the updates that have been made to a particular book?
>> What are the details for a particular update?
>>
>>
>> Update frequently will be like Update will happen based on Title, name,
>> Author, price , publisher like. So not very high frequently.
>>
>> Best Regards,
>> Nandan
>>
>

Re: Long running compaction on huge hint table.

2017-05-16 Thread Jeff Jirsa

You could also try stopping compaction, but that'll probably take a very long 
time as well

Manually stopping each node (one at a time) and removing the sstables from only 
system.hints may be a better option. May want to take a snapshot if you're very 
concerned with that data.




-- 
Jeff Jirsa


> On May 16, 2017, at 6:53 PM, varun saluja  wrote:
> 
> Hi,
> 
>  
>  Truncatehints on nodes is running for more than 7 hours now. Nothing 
> mentioned for same in sysemt logs even.
> 
> And compaction stats reports increase in hints total bytes.
> 
> pending tasks: 1
>compaction type   keyspace   table completed  totalunit   
> progress
> Compaction system   hints   12152557998   869257869352   bytes
>   1.40%
> Active compaction remaining time :   0h27m14s
> 
> Can anything else be checked here? Will manually deleting system.hint files 
> and restart node fix this.
> 
> 
> 
> Regards,
> Varun Saluja
> 
>> On 16 May 2017 at 23:29, varun saluja  wrote:
>> Hi Jeff,
>> 
>> I ran nodetool truncatehints  on all nodes. Its running for more than 30 
>> mins now. Status for compactstats reports same.
>> 
>> pending tasks: 1
>>compaction type   keyspace   table completed  totalunit   
>> progress
>> Compaction system   hints   11189118129   851658989612   bytes   
>>1.31%
>> Active compaction remaining time :   0h26m43s
>> 
>> Will truncatehints takes time for completion? Could not see anything related 
>> truncatehints in system logs.
>> 
>> Please let me know if anything else can be checked here.
>> 
>> Regards,
>> Varun Saluja 
>> 
>> 
>> 
>>> On 16 May 2017 at 20:58, varun saluja  wrote:
>>> Thanks a lot Jeff.
>>> 
>>> You have explaned very well here. We have consitency as local quorum. Will 
>>> follow truncate hints and repair therafter.
>>> 
>>> I hope this brings cluster in stable state
>>> 
>>> Thanks again.
>>> 
>>> Regards,
>>> Varun Saluja
>>> 
>>> Sent from my iPhone
>>> 
>>> > On 16-May-2017, at 8:42 PM, Jeff Jirsa  wrote:
>>> >
>>> >
>>> > In Cassandra versions up to 3.0, hints are stored within a table, where 
>>> > the partition key is the host ID of the server for which the hints are 
>>> > stored.
>>> >
>>> > In such a data model, accumulating 800GB of hints is almost certain to 
>>> > cause very wide rows, which will in turn cause GC pressure when you 
>>> > attempt to read the hints for delivery. This will cause GC pauses, which 
>>> > will cause hints to fail to be delivered, which will cause more hints to 
>>> > be stored. This is bad.
>>> >
>>> > In 3.0, hints were rewritten to work around this design flaw. In 2.1, 
>>> > your most likely corrective course is to use 'nodetool truncatehints' on 
>>> > all servers, followed by 'nodetool repair' to deliver the data you lost 
>>> > by truncating the hints.
>>> >
>>> > NOTE: this is ONLY safe if you wrote with a consistency level stronger 
>>> > than CL:ANY. If you wrote this data with CL:ANY, you may lose data if you 
>>> > truncate hints.
>>> >
>>> > - Jeff
>>> >
>>> >> On 2017-05-16 06:50 (-0700), varun saluja  wrote:
>>> >> Thanks for update.
>>> >> I could see lot of io waits. This causing  Gc and mutation drops .
>>> >> But as i mentioned we do not have high load for now. Hint replays are 
>>> >> creating such high disk I/O.
>>> >> compactionstats show very high hint bytes like 780gb around. Is this 
>>> >> normal?
>>> >>
>>> >> Just mentioning we are using flash disks.
>>> >>
>>> >> In such case, if i run truncatehints , will it remove or decrease size 
>>> >> of hints bytes in compaction stats. I can trigger repair therafter.
>>> >> Please let me know if any recommendation on same.
>>> >>
>>> >> Also , table which we dumped from kafka which created this much hints 
>>> >> and compaction pendings is also dropped today. Because we have to redump 
>>> >> table again once cluster is stable.
>>> >>
>>> >> Regards,
>>> >> Varun
>>> >>
>>> >> Sent from my iPhone
>>> >>
>>> >>> On 16-May-2017, at 6:59 PM, Nitan Kainth  wrote:
>>> >>>
>>> >>> Yes but it means data has to be replicated using repair.
>>> >>>
>>> >>> Hints are out come of unhealthy nodes, focus on finding why you have 
>>> >>> mutation drops, is it node, io or network etc. ideally you shouldn't 
>>> >>> see increasing hints all the time.
>>> >>>
>>> >>> Sent from my iPhone
>>> >>>
>>>  On May 16, 2017, at 7:58 AM, varun saluja  wrote:
>>> 
>>>  Hi Nitan,
>>> 
>>>  Thanks for response.
>>> 
>>>  Yes, I could see mutation drops and increase count in system.hints. Is 
>>>  there any way , i can proceed to truncate hints like using nodetool 
>>>  truncatehints.
>>> 
>>> 
>>>  Regards,
>>>  Varun Saluja
>>> 
>>> > On 16 May 2017 at 17:52, Nitan Kainth  wrote:
>>> > Do you see mutation drops?
>>> >

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger

This is a bit of guessing but it probably reads sstables in some sort of 
sequence, so even if sstable 2 contains the tombstone, it still scans through 
the sstable 1 for possible data to be read.

BR,
Hannu

> On 16 May 2017, at 19:40, Stefano Ortolani  wrote:
> 
> Little update: also the following query timeouts, which is weird since the 
> range tombstone should have been read by then...
> 
> SELECT * 
> FROM test_cql.test_cf 
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
> AND timeid < the_oldest_deleted_timeid
> ORDER BY timeid DESC;
> 
> 
> 
> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani  > wrote:
> Yes, that was my intention but I wanted to cross-check with the ML and the 
> devs keeping an eye on it first.
> 
> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger  > wrote:
> Well,
> 
> sstables contain some statistics about the cell timestamps and using that 
> information and the tombstone timestamp it might be possible to skip some 
> data but I’m not sure that Cassandra currently does that. Maybe it would be 
> worth a JIRA ticket and see what the devs think about it. If optimizing this 
> case would make sense.
> 
> Hannu
> 
>> On 16 May 2017, at 18:03, Stefano Ortolani > > wrote:
>> 
>> Hi Hannu,
>> 
>> the piece of data in question is older. In my example the tombstone is the 
>> newest piece of data.
>> Since a range tombstone has information re the clustering key ranges, and 
>> the data is clustering key sorted, I would expect a linear scan not to be 
>> necessary.
>> 
>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger > > wrote:
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
>> bigger regions of deleted data based on range tombstone. If some piece of 
>> data in a partition is newer than the tombstone, then it cannot be skipped. 
>> Therefore some partition level statistics of cell ages would need to be kept 
>> in the column index for the skipping and that is probably not there.
>> 
>> Hannu 
>> 
>>> On 16 May 2017, at 17:33, Stefano Ortolani >> > wrote:
>>> 
>>> That is another way to see the question: are reverse iterators range 
>>> tombstone aware? Yes.
>>> That is why I am puzzled by this afore-mentioned behavior. 
>>> I would expect them to handle this case more gracefully.
>>> 
>>> Cheers,
>>> Stefano
>>> 
>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth >> > wrote:
>>> Hannu,
>>> 
>>> How can you read a partition in reverse?
>>> 
>>> Sent from my iPhone
>>> 
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger >> > > wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>>> > tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is within 
>>> > the range of the tombstone but is newer than the tombstone and therefore 
>>> > it might be still be returned. Scanning through deleted data can be 
>>> > avoided by reading the partition in reverse (if all the deleted data is 
>>> > in the beginning of the partition). Eventually you will still end up 
>>> > reading a lot of tombstones but you will get a lot of live data first and 
>>> > the implicit query limit of 1 probably is reached before you get to 
>>> > the tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani >> >> > wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>>> >> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence the 
>>> >> message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as 
>>> >> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the 
>>> >> oldest _half_ of that partition by executing the query below, and 
>>> >> restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes more 
>>> >> than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani

But it should skip those records since they are sorted. My understanding
would be something like:

1) read sstable 2
2) read the range tombstone
3) skip records from sstable2 and sstable1 within the range boundaries
4) read remaining records from sstable1
5) no records, return

On Tue, May 16, 2017 at 5:43 PM, Hannu Kröger  wrote:

> This is a bit of guessing but it probably reads sstables in some sort of
> sequence, so even if sstable 2 contains the tombstone, it still scans
> through the sstable 1 for possible data to be read.
>
> BR,
> Hannu
>
> On 16 May 2017, at 19:40, Stefano Ortolani  wrote:
>
> Little update: also the following query timeouts, which is weird since the
> range tombstone should have been read by then...
>
> SELECT *
> FROM test_cql.test_cf
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> AND timeid < the_oldest_deleted_timeid
> ORDER BY timeid DESC;
>
>
>
> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani 
> wrote:
>
>> Yes, that was my intention but I wanted to cross-check with the ML and
>> the devs keeping an eye on it first.
>>
>> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger  wrote:
>>
>>> Well,
>>>
>>> sstables contain some statistics about the cell timestamps and using
>>> that information and the tombstone timestamp it might be possible to skip
>>> some data but I’m not sure that Cassandra currently does that. Maybe it
>>> would be worth a JIRA ticket and see what the devs think about it. If
>>> optimizing this case would make sense.
>>>
>>> Hannu
>>>
>>> On 16 May 2017, at 18:03, Stefano Ortolani  wrote:
>>>
>>> Hi Hannu,
>>>
>>> the piece of data in question is older. In my example the tombstone is
>>> the newest piece of data.
>>> Since a range tombstone has information re the clustering key ranges,
>>> and the data is clustering key sorted, I would expect a linear scan not to
>>> be necessary.
>>>
>>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:
>>>
 Well, as mentioned, probably Cassandra doesn’t have logic and data to
 skip bigger regions of deleted data based on range tombstone. If some piece
 of data in a partition is newer than the tombstone, then it cannot be
 skipped. Therefore some partition level statistics of cell ages would need
 to be kept in the column index for the skipping and that is probably not
 there.

 Hannu

 On 16 May 2017, at 17:33, Stefano Ortolani  wrote:

 That is another way to see the question: are reverse iterators range
 tombstone aware? Yes.
 That is why I am puzzled by this afore-mentioned behavior.
 I would expect them to handle this case more gracefully.

 Cheers,
 Stefano

 On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth 
 wrote:

> Hannu,
>
> How can you read a partition in reverse?
>
> Sent from my iPhone
>
> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
> >
> > Well, I’m guessing that Cassandra doesn't really know if the range
> tombstone is useful for this or not.
> >
> > In many cases it might be that the partition contains data that is
> within the range of the tombstone but is newer than the tombstone and
> therefore it might be still be returned. Scanning through deleted data can
> be avoided by reading the partition in reverse (if all the deleted data is
> in the beginning of the partition). Eventually you will still end up
> reading a lot of tombstones but you will get a lot of live data first and
> the implicit query limit of 1 probably is reached before you get to 
> the
> tombstones. Therefore you will get an immediate answer.
> >
> > Does it make sense?
> >
> > Hannu
> >
> >> On 16 May 2017, at 16:33, Stefano Ortolani 
> wrote:
> >>
> >> Hi all,
> >>
> >> I am seeing inconsistencies when mixing range tombstones, wide
> partitions, and reverse iterators.
> >> I still have to understand if the behaviour is to be expected hence
> the message on the mailing list.
> >>
> >> The situation is conceptually simple. I am using a table defined as
> follows:
> >>
> >> CREATE TABLE test_cql.test_cf (
> >>  hash blob,
> >>  timeid timeuuid,
> >>  PRIMARY KEY (hash, timeid)
> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
> >>
> >> I then proceed by loading 2/3GB from 3 sstables which I know
> contain a really wide partition (> 512 MB) for `hash = x`. I then delete
> the oldest _half_ of that partition by executing the query below, and
> restart the node:
> >>
> >> DELETE
> >> FROM test_cql.test_cf
> >> WHERE hash = x AND timeid < y;
> >>
> >> If I keep

Re: Long running compaction on huge hint table.

2017-05-16 Thread varun saluja

Hi Jeff,

I ran nodetool truncatehints  on all nodes. Its running for more than 30
mins now. Status for compactstats reports same.

pending tasks: 1
   compaction type   keyspace   table completed  totalunit
  progress
Compaction system   hints   11189118129   851658989612   bytes
 1.31%
Active compaction remaining time :   0h26m43s

Will truncatehints takes time for completion? Could not see anything
related truncatehints in system logs.

Please let me know if anything else can be checked here.

Regards,
Varun Saluja



On 16 May 2017 at 20:58, varun saluja  wrote:

> Thanks a lot Jeff.
>
> You have explaned very well here. We have consitency as local quorum. Will
> follow truncate hints and repair therafter.
>
> I hope this brings cluster in stable state
>
> Thanks again.
>
> Regards,
> Varun Saluja
>
> Sent from my iPhone
>
> > On 16-May-2017, at 8:42 PM, Jeff Jirsa  wrote:
> >
> >
> > In Cassandra versions up to 3.0, hints are stored within a table, where
> the partition key is the host ID of the server for which the hints are
> stored.
> >
> > In such a data model, accumulating 800GB of hints is almost certain to
> cause very wide rows, which will in turn cause GC pressure when you attempt
> to read the hints for delivery. This will cause GC pauses, which will cause
> hints to fail to be delivered, which will cause more hints to be stored.
> This is bad.
> >
> > In 3.0, hints were rewritten to work around this design flaw. In 2.1,
> your most likely corrective course is to use 'nodetool truncatehints' on
> all servers, followed by 'nodetool repair' to deliver the data you lost by
> truncating the hints.
> >
> > NOTE: this is ONLY safe if you wrote with a consistency level stronger
> than CL:ANY. If you wrote this data with CL:ANY, you may lose data if you
> truncate hints.
> >
> > - Jeff
> >
> >> On 2017-05-16 06:50 (-0700), varun saluja  wrote:
> >> Thanks for update.
> >> I could see lot of io waits. This causing  Gc and mutation drops .
> >> But as i mentioned we do not have high load for now. Hint replays are
> creating such high disk I/O.
> >> compactionstats show very high hint bytes like 780gb around. Is this
> normal?
> >>
> >> Just mentioning we are using flash disks.
> >>
> >> In such case, if i run truncatehints , will it remove or decrease size
> of hints bytes in compaction stats. I can trigger repair therafter.
> >> Please let me know if any recommendation on same.
> >>
> >> Also , table which we dumped from kafka which created this much hints
> and compaction pendings is also dropped today. Because we have to redump
> table again once cluster is stable.
> >>
> >> Regards,
> >> Varun
> >>
> >> Sent from my iPhone
> >>
> >>> On 16-May-2017, at 6:59 PM, Nitan Kainth  wrote:
> >>>
> >>> Yes but it means data has to be replicated using repair.
> >>>
> >>> Hints are out come of unhealthy nodes, focus on finding why you have
> mutation drops, is it node, io or network etc. ideally you shouldn't see
> increasing hints all the time.
> >>>
> >>> Sent from my iPhone
> >>>
>  On May 16, 2017, at 7:58 AM, varun saluja  wrote:
> 
>  Hi Nitan,
> 
>  Thanks for response.
> 
>  Yes, I could see mutation drops and increase count in system.hints.
> Is there any way , i can proceed to truncate hints like using nodetool
> truncatehints.
> 
> 
>  Regards,
>  Varun Saluja
> 
> > On 16 May 2017 at 17:52, Nitan Kainth  wrote:
> > Do you see mutation drops?
> > Select count from system.hints; is it increasing?
> >
> > Sent from my iPhone
> >
> >> On May 16, 2017, at 5:52 AM, varun saluja 
> wrote:
> >>
> >> Hi Experts,
> >>
> >> We are facing issue on production cluster. Compaction on
> system.hint table is running from last 2 days.
> >>
> >>
> >> pending tasks: 1
> >>   compaction type   keyspace   table completed  total
> unit   progress
> >>  Compaction system   hints   20623021829
>  877874092407   bytes  2.35%
> >> Active compaction remaining time :   0h27m15s
> >>
> >>
> >> Active compaction remaining time shows in minutes.  But, this is
> job is running like indefinitely.
> >>
> >> We have 3 node cluster V 2.1.7. And we ran  write intensive job
> last week on particular table.
> >> Compaction on this table finished but hint table size is growing
> continuously.
> >>
> >> Can someone Please help me.
> >>
> >>
> >> Thanks & Regards,
> >> Varun Saluja
> >>
> 
> >>
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger

Yes, I agree. I would say it cannot skip those cells because it doesn’t check 
the max timestamp of the cells of the sstable and therefore scans them one by 
one.

Hannu
 
> On 16 May 2017, at 19:48, Stefano Ortolani  wrote:
> 
> But it should skip those records since they are sorted. My understanding 
> would be something like:
> 
> 1) read sstable 2
> 2) read the range tombstone
> 3) skip records from sstable2 and sstable1 within the range boundaries
> 4) read remaining records from sstable1
> 5) no records, return
> 
> On Tue, May 16, 2017 at 5:43 PM, Hannu Kröger  > wrote:
> This is a bit of guessing but it probably reads sstables in some sort of 
> sequence, so even if sstable 2 contains the tombstone, it still scans through 
> the sstable 1 for possible data to be read.
> 
> BR,
> Hannu
> 
>> On 16 May 2017, at 19:40, Stefano Ortolani > > wrote:
>> 
>> Little update: also the following query timeouts, which is weird since the 
>> range tombstone should have been read by then...
>> 
>> SELECT * 
>> FROM test_cql.test_cf 
>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>> AND timeid < the_oldest_deleted_timeid
>> ORDER BY timeid DESC;
>> 
>> 
>> 
>> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani > > wrote:
>> Yes, that was my intention but I wanted to cross-check with the ML and the 
>> devs keeping an eye on it first.
>> 
>> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger > > wrote:
>> Well,
>> 
>> sstables contain some statistics about the cell timestamps and using that 
>> information and the tombstone timestamp it might be possible to skip some 
>> data but I’m not sure that Cassandra currently does that. Maybe it would be 
>> worth a JIRA ticket and see what the devs think about it. If optimizing this 
>> case would make sense.
>> 
>> Hannu
>> 
>>> On 16 May 2017, at 18:03, Stefano Ortolani >> > wrote:
>>> 
>>> Hi Hannu,
>>> 
>>> the piece of data in question is older. In my example the tombstone is the 
>>> newest piece of data.
>>> Since a range tombstone has information re the clustering key ranges, and 
>>> the data is clustering key sorted, I would expect a linear scan not to be 
>>> necessary.
>>> 
>>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger >> > wrote:
>>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
>>> bigger regions of deleted data based on range tombstone. If some piece of 
>>> data in a partition is newer than the tombstone, then it cannot be skipped. 
>>> Therefore some partition level statistics of cell ages would need to be 
>>> kept in the column index for the skipping and that is probably not there.
>>> 
>>> Hannu 
>>> 
 On 16 May 2017, at 17:33, Stefano Ortolani > wrote:
 
 That is another way to see the question: are reverse iterators range 
 tombstone aware? Yes.
 That is why I am puzzled by this afore-mentioned behavior. 
 I would expect them to handle this case more gracefully.
 
 Cheers,
 Stefano
 
 On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth > wrote:
 Hannu,
 
 How can you read a partition in reverse?
 
 Sent from my iPhone
 
 > On May 16, 2017, at 9:20 AM, Hannu Kröger  > wrote:
 >
 > Well, I’m guessing that Cassandra doesn't really know if the range 
 > tombstone is useful for this or not.
 >
 > In many cases it might be that the partition contains data that is 
 > within the range of the tombstone but is newer than the tombstone and 
 > therefore it might be still be returned. Scanning through deleted data 
 > can be avoided by reading the partition in reverse (if all the deleted 
 > data is in the beginning of the partition). Eventually you will still 
 > end up reading a lot of tombstones but you will get a lot of live data 
 > first and the implicit query limit of 1 probably is reached before 
 > you get to the tombstones. Therefore you will get an immediate answer.
 >
 > Does it make sense?
 >
 > Hannu
 >
 >> On 16 May 2017, at 16:33, Stefano Ortolani > > wrote:
 >>
 >> Hi all,
 >>
 >> I am seeing inconsistencies when mixing range tombstones, wide 
 >> partitions, and reverse iterators.
 >> I still have to understand if the behaviour is to be expected hence the 
 >> message on the mailing list.
 >>
 >> The situation is conceptually simple. I am using a table defined as 
 >> follows:
 >>
 >> CREATE TABLE

RE: Decommissioned node cluster shows as down

2017-05-16 Thread Mark Furlong

I thought the same that the decommission would complete the removal of a node. 
I have heard something said about a 72 hour window, I’m not sure if that 
pertains to this version.

Thanks
Mark
801-705-7115 office

From: Hannu Kröger [mailto:hkro...@gmail.com]
Sent: Tuesday, May 16, 2017 10:09 AM
To: suraj pasuparthy 
Cc: Mark Furlong ; user@cassandra.apache.org
Subject: Re: Decommissioned node cluster shows as down

That’s weird. I thought decommission would ultimately remove the node from the 
cluster because the token(s) should be removed from the ring and data should be 
streamed to new owners. “DN” is IMHO not a state where the node should end up 
in.

Hannu

On 16 May 2017, at 19:05, suraj pasuparthy 
> wrote:

Yes, you have to run a nodetool removenode to decomission completely.. this 
will also allow another node with the same ip different HashId to join the 
cluster..

Thanks
-suraj
On Tue, May 16, 2017 at 9:01 AM Mark Furlong 
> wrote:

I have a node I decommissioned on a large ring using 2.1.12. The node completed 
the decommission process and is no longer communicating with the rest of the 
cluster. However when I run a nodetool status on any node in the cluster it 
shows

the node as ‘DN’. Why is this and should I just run a removenode now?

Thanks,

Mark Furlong

Sr. Database Administrator

mfurl...@ancestry.com

M: 801-859-7427

O: 801-705-7115

1300 W Traverse Pkwy

Lehi, UT 84043

Re: RE: Decommissioned node cluster shows as down

2017-05-16 Thread Jeff Jirsa

On 2017-05-16 09:28 (-0700), Mark Furlong  wrote: 
> I thought the same that the decommission would complete the removal of a 
> node. I have heard something said about a 72 hour window, Iâm not sure if 
> that pertains to this version.
> 

We keep a record of it in gossip for 72 hours (in a special status to indicate 
that the node left the ring), just in case we had a host that was offline 
during the decommission come back to life and still have a record of that 
now-removed host in its saved system tables. It'll be in gossip, but it 
shouldn't be in 'nodetool ring' or 'nodetool status' output.

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Read timeouts

2017-05-16 Thread Nitan Kainth

Thank you Jeff.

We are at Cassandra 3.0.10

Will look forward to upgrade or enable driver logging.

> On May 16, 2017, at 11:44 AM, Jeff Jirsa  wrote:
> 
> 
> 
> On 2017-05-16 08:53 (-0700), Nitan Kainth  wrote: 
>> Hi,
>> 
>> We see read timeouts intermittently. Mostly after they have occurred. 
>> Timeouts are not consistent and does not occur in 100s at a moment. 
>> 
>> 1. Does read timeout considered as Dropped Mutation?
> 
> No, a dropped mutation is a failed write, not a failed read.
> 
>> 2. What is best way to nail down exact issue of scattered timeouts?
>> 
> 
> First, be aware that tombstone overwhelming exceptions also get propagated as 
> read timeouts - you should check your logs for warnings about tombstone 
> problems.
> 
> Second, you need to identify the slow queries somehow. You have a few options:
> 
> 1) If you happen to be running 3.10 or newer , turn on the slow query log ( 
> https://issues.apache.org/jira/browse/CASSANDRA-12403 ) . 3.10 is the newest 
> release, and may not be fully stable, so you probably don't want to upgrade 
> to 3.10 JUST to get this feature. But if you're already on that version, 
> definitely use that tool.
> 
> 2) Some drivers have a log-slow-queries feature. Consider turning that on, 
> and let the application side log the slow queries. It's possible that you 
> have a bad partition or two, and you may see patterns there.
> 
> 3) Probabilistic tracing - you can tell cassandra to trace 1% of your 
> queries, and hope you catch a timeout. It'll be unpleasant to track alone - 
> this is really a last-resort type option, because you'll need to dig through 
> that trace table to find the outliers after the fact.
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani

Little update: also the following query timeouts, which is weird since the
range tombstone should have been read by then...

SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
AND timeid < the_oldest_deleted_timeid
ORDER BY timeid DESC;



On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani 
wrote:

> Yes, that was my intention but I wanted to cross-check with the ML and the
> devs keeping an eye on it first.
>
> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger  wrote:
>
>> Well,
>>
>> sstables contain some statistics about the cell timestamps and using that
>> information and the tombstone timestamp it might be possible to skip some
>> data but I’m not sure that Cassandra currently does that. Maybe it would be
>> worth a JIRA ticket and see what the devs think about it. If optimizing
>> this case would make sense.
>>
>> Hannu
>>
>> On 16 May 2017, at 18:03, Stefano Ortolani  wrote:
>>
>> Hi Hannu,
>>
>> the piece of data in question is older. In my example the tombstone is
>> the newest piece of data.
>> Since a range tombstone has information re the clustering key ranges, and
>> the data is clustering key sorted, I would expect a linear scan not to be
>> necessary.
>>
>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:
>>
>>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>>> skip bigger regions of deleted data based on range tombstone. If some piece
>>> of data in a partition is newer than the tombstone, then it cannot be
>>> skipped. Therefore some partition level statistics of cell ages would need
>>> to be kept in the column index for the skipping and that is probably not
>>> there.
>>>
>>> Hannu
>>>
>>> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
>>>
>>> That is another way to see the question: are reverse iterators range
>>> tombstone aware? Yes.
>>> That is why I am puzzled by this afore-mentioned behavior.
>>> I would expect them to handle this case more gracefully.
>>>
>>> Cheers,
>>> Stefano
>>>
>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:
>>>
 Hannu,

 How can you read a partition in reverse?

 Sent from my iPhone

 > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
 >
 > Well, I’m guessing that Cassandra doesn't really know if the range
 tombstone is useful for this or not.
 >
 > In many cases it might be that the partition contains data that is
 within the range of the tombstone but is newer than the tombstone and
 therefore it might be still be returned. Scanning through deleted data can
 be avoided by reading the partition in reverse (if all the deleted data is
 in the beginning of the partition). Eventually you will still end up
 reading a lot of tombstones but you will get a lot of live data first and
 the implicit query limit of 1 probably is reached before you get to the
 tombstones. Therefore you will get an immediate answer.
 >
 > Does it make sense?
 >
 > Hannu
 >
 >> On 16 May 2017, at 16:33, Stefano Ortolani 
 wrote:
 >>
 >> Hi all,
 >>
 >> I am seeing inconsistencies when mixing range tombstones, wide
 partitions, and reverse iterators.
 >> I still have to understand if the behaviour is to be expected hence
 the message on the mailing list.
 >>
 >> The situation is conceptually simple. I am using a table defined as
 follows:
 >>
 >> CREATE TABLE test_cql.test_cf (
 >>  hash blob,
 >>  timeid timeuuid,
 >>  PRIMARY KEY (hash, timeid)
 >> ) WITH CLUSTERING ORDER BY (timeid ASC)
 >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
 >>
 >> I then proceed by loading 2/3GB from 3 sstables which I know contain
 a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
 _half_ of that partition by executing the query below, and restart the 
 node:
 >>
 >> DELETE
 >> FROM test_cql.test_cf
 >> WHERE hash = x AND timeid < y;
 >>
 >> If I keep compactions disabled the following query timeouts (takes
 more than 10 seconds to
 >> succeed):
 >>
 >> SELECT *
 >> FROM test_cql.test_cf
 >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
 >> ORDER BY timeid ASC;
 >>
 >> While the following returns immediately (obviously because no
 deleted data is ever read):
 >>
 >> SELECT *
 >> FROM test_cql.test_cf
 >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
 >> ORDER BY timeid DESC;
 >>
 >> If I force a compaction the problem is gone, but I presume just
 because the data is rearranged.
 >>
 >> It seems to me that reading by ASC does not make use of the range
 tombstone until C* reads the
 >> last sstables (which actually

Re: Read timeouts

2017-05-16 Thread Jeff Jirsa

On 2017-05-16 08:53 (-0700), Nitan Kainth  wrote: 
> Hi,
> 
> We see read timeouts intermittently. Mostly after they have occurred. 
> Timeouts are not consistent and does not occur in 100s at a moment. 
> 
> 1. Does read timeout considered as Dropped Mutation?

No, a dropped mutation is a failed write, not a failed read.

> 2. What is best way to nail down exact issue of scattered timeouts?
> 

First, be aware that tombstone overwhelming exceptions also get propagated as 
read timeouts - you should check your logs for warnings about tombstone 
problems.

Second, you need to identify the slow queries somehow. You have a few options:

1) If you happen to be running 3.10 or newer , turn on the slow query log ( 
https://issues.apache.org/jira/browse/CASSANDRA-12403 ) . 3.10 is the newest 
release, and may not be fully stable, so you probably don't want to upgrade to 
3.10 JUST to get this feature. But if you're already on that version, 
definitely use that tool.

2) Some drivers have a log-slow-queries feature. Consider turning that on, and 
let the application side log the slow queries. It's possible that you have a 
bad partition or two, and you may see patterns there.

3) Probabilistic tracing - you can tell cassandra to trace 1% of your queries, 
and hope you catch a timeout. It'll be unpleasant to track alone - this is 
really a last-resort type option, because you'll need to dig through that trace 
table to find the outliers after the fact.

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Decommissioned node cluster shows as down

2017-05-16 Thread Jeff Jirsa



On 2017-05-16 09:08 (-0700), Hannu KrÃ¶ger  wrote: 
> Thatâs weird. I thought decommission would ultimately remove the node from 
> the cluster because the token(s) should be removed from the ring and data 
> should be streamed to new owners. âDNâ is IMHO not a state where the node 
> should end up in. 
> 

Decommission should remove the node from the cluster, it should not be in "DN" 
if decommission finished successfully.



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Long running compaction on huge hint table.

2017-05-16 Thread varun saluja

Thanks for update.
I could see lot of io waits. This causing  Gc and mutation drops .
But as i mentioned we do not have high load for now. Hint replays are creating 
such high disk I/O.
compactionstats show very high hint bytes like 780gb around. Is this normal?

Just mentioning we are using flash disks.

In such case, if i run truncatehints , will it remove or decrease size of hints 
bytes in compaction stats. I can trigger repair therafter.
Please let me know if any recommendation on same.

Also , table which we dumped from kafka which created this much hints and 
compaction pendings is also dropped today. Because we have to redump table 
again once cluster is stable.

Regards,
Varun

Sent from my iPhone

> On 16-May-2017, at 6:59 PM, Nitan Kainth  wrote:
> 
> Yes but it means data has to be replicated using repair.
> 
> Hints are out come of unhealthy nodes, focus on finding why you have mutation 
> drops, is it node, io or network etc. ideally you shouldn't see increasing 
> hints all the time.
> 
> Sent from my iPhone
> 
>> On May 16, 2017, at 7:58 AM, varun saluja  wrote:
>> 
>> Hi Nitan,
>> 
>> Thanks for response.
>> 
>> Yes, I could see mutation drops and increase count in system.hints. Is there 
>> any way , i can proceed to truncate hints like using nodetool truncatehints.
>> 
>> 
>> Regards,
>> Varun Saluja
>> 
>>> On 16 May 2017 at 17:52, Nitan Kainth  wrote:
>>> Do you see mutation drops?
>>> Select count from system.hints; is it increasing?
>>> 
>>> Sent from my iPhone
>>> 
 On May 16, 2017, at 5:52 AM, varun saluja  wrote:

 Hi Experts,

 We are facing issue on production cluster. Compaction on system.hint table 
 is running from last 2 days.

 pending tasks: 1
compaction type   keyspace   table completed  total 
  unit   progress
   Compaction system   hints   20623021829   877874092407   
 bytes  2.35%
 Active compaction remaining time :   0h27m15s

 Active compaction remaining time shows in minutes.  But, this is job is 
 running like indefinitely.

 We have 3 node cluster V 2.1.7. And we ran  write intensive job last week 
 on particular table.
 Compaction on this table finished but hint table size is growing 
 continuously.

 Can someone Please help me.

 Thanks & Regards,
 Varun Saluja

>>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Nitan Kainth

If the data is stored in ASC order and query asks for DESC, then wouldn’t it 
read whole partition in first and then pick data from reverse order?


> On May 16, 2017, at 10:03 AM, Stefano Ortolani  wrote:
> 
> Hi Hannu,
> 
> the piece of data in question is older. In my example the tombstone is the 
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and the 
> data is clustering key sorted, I would expect a linear scan not to be 
> necessary.
> 
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  > wrote:
> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
> bigger regions of deleted data based on range tombstone. If some piece of 
> data in a partition is newer than the tombstone, then it cannot be skipped. 
> Therefore some partition level statistics of cell ages would need to be kept 
> in the column index for the skipping and that is probably not there.
> 
> Hannu 
> 
>> On 16 May 2017, at 17:33, Stefano Ortolani > > wrote:
>> 
>> That is another way to see the question: are reverse iterators range 
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior. 
>> I would expect them to handle this case more gracefully.
>> 
>> Cheers,
>> Stefano
>> 
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth > > wrote:
>> Hannu,
>> 
>> How can you read a partition in reverse?
>> 
>> Sent from my iPhone
>> 
>> > On May 16, 2017, at 9:20 AM, Hannu Kröger > > > wrote:
>> >
>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>> > tombstone is useful for this or not.
>> >
>> > In many cases it might be that the partition contains data that is within 
>> > the range of the tombstone but is newer than the tombstone and therefore 
>> > it might be still be returned. Scanning through deleted data can be 
>> > avoided by reading the partition in reverse (if all the deleted data is in 
>> > the beginning of the partition). Eventually you will still end up reading 
>> > a lot of tombstones but you will get a lot of live data first and the 
>> > implicit query limit of 1 probably is reached before you get to the 
>> > tombstones. Therefore you will get an immediate answer.
>> >
>> > Does it make sense?
>> >
>> > Hannu
>> >
>> >> On 16 May 2017, at 16:33, Stefano Ortolani > >> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>> >> partitions, and reverse iterators.
>> >> I still have to understand if the behaviour is to be expected hence the 
>> >> message on the mailing list.
>> >>
>> >> The situation is conceptually simple. I am using a table defined as 
>> >> follows:
>> >>
>> >> CREATE TABLE test_cql.test_cf (
>> >>  hash blob,
>> >>  timeid timeuuid,
>> >>  PRIMARY KEY (hash, timeid)
>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> >>
>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
>> >> _half_ of that partition by executing the query below, and restart the 
>> >> node:
>> >>
>> >> DELETE
>> >> FROM test_cql.test_cf
>> >> WHERE hash = x AND timeid < y;
>> >>
>> >> If I keep compactions disabled the following query timeouts (takes more 
>> >> than 10 seconds to
>> >> succeed):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid ASC;
>> >>
>> >> While the following returns immediately (obviously because no deleted 
>> >> data is ever read):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid DESC;
>> >>
>> >> If I force a compaction the problem is gone, but I presume just because 
>> >> the data is rearranged.
>> >>
>> >> It seems to me that reading by ASC does not make use of the range 
>> >> tombstone until C* reads the
>> >> last sstables (which actually contains the range tombstone and is flushed 
>> >> at node restart), and it wastes time reading all rows that are actually 
>> >> not live anymore.
>> >>
>> >> Is this expected? Should the range tombstone actually help in these cases?
>> >>
>> >> Thanks a lot!
>> >> Stefano
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>> > 
>> > For additional commands, e-mail: user-h...@cassandra.apache.org 
>> > 
>> >
>> 
> 
>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani

No, because C* has reverse iterators.

On Tue, May 16, 2017 at 4:47 PM, Nitan Kainth  wrote:

> If the data is stored in ASC order and query asks for DESC, then wouldn’t
> it read whole partition in first and then pick data from reverse order?
>
>
> On May 16, 2017, at 10:03 AM, Stefano Ortolani  wrote:
>
> Hi Hannu,
>
> the piece of data in question is older. In my example the tombstone is the
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and
> the data is clustering key sorted, I would expect a linear scan not to be
> necessary.
>
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:
>
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>> skip bigger regions of deleted data based on range tombstone. If some piece
>> of data in a partition is newer than the tombstone, then it cannot be
>> skipped. Therefore some partition level statistics of cell ages would need
>> to be kept in the column index for the skipping and that is probably not
>> there.
>>
>> Hannu
>>
>> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
>>
>> That is another way to see the question: are reverse iterators range
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior.
>> I would expect them to handle this case more gracefully.
>>
>> Cheers,
>> Stefano
>>
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:
>>
>>> Hannu,
>>>
>>> How can you read a partition in reverse?
>>>
>>> Sent from my iPhone
>>>
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>> tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is
>>> within the range of the tombstone but is newer than the tombstone and
>>> therefore it might be still be returned. Scanning through deleted data can
>>> be avoided by reading the partition in reverse (if all the deleted data is
>>> in the beginning of the partition). Eventually you will still end up
>>> reading a lot of tombstones but you will get a lot of live data first and
>>> the implicit query limit of 1 probably is reached before you get to the
>>> tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani 
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide
>>> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence
>>> the message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as
>>> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
>>> a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>>> _half_ of that partition by executing the query below, and restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes
>>> more than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>> >>
>>> >> While the following returns immediately (obviously because no deleted
>>> data is ever read):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid DESC;
>>> >>
>>> >> If I force a compaction the problem is gone, but I presume just
>>> because the data is rearranged.
>>> >>
>>> >> It seems to me that reading by ASC does not make use of the range
>>> tombstone until C* reads the
>>> >> last sstables (which actually contains the range tombstone and is
>>> flushed at node restart), and it wastes time reading all rows that are
>>> actually not live anymore.
>>> >>
>>> >> Is this expected? Should the range tombstone actually help in these
>>> cases?
>>> >>
>>> >> Thanks a lot!
>>> >> Stefano
>>> >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>>> >
>>>
>>
>>
>>
>
>

Re: Decommissioned node cluster shows as down

2017-05-16 Thread suraj pasuparthy

Yes, you have to run a nodetool removenode to decomission completely.. this
will also allow another node with the same ip different HashId to join the
cluster..

Thanks
-suraj
On Tue, May 16, 2017 at 9:01 AM Mark Furlong  wrote:

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I have a node I decommissioned on a large ring using 2.1.12. The node
> completed the decommission process and is no longer communicating with the
> rest of the cluster. However when I run a nodetool status on any node in
> the cluster it shows
>
> the node as ‘DN’. Why is this and should I just run a removenode now?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Thanks,*
>
>
> *Mark Furlong*
>
>
>
>
>
>
>
> Sr. Database Administrator
>
>
>
>
>
>
>
> *mfurl...@ancestry.com *
>
>
> M: 801-859-7427
>
>
> O: 801-705-7115
>
>
> 1300 W Traverse Pkwy
>
>
> Lehi, UT 84043
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> [image: http://c.mfcreative.com/mars/email/shared-icon/sig-logo.gif]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger

Well,

sstables contain some statistics about the cell timestamps and using that 
information and the tombstone timestamp it might be possible to skip some data 
but I’m not sure that Cassandra currently does that. Maybe it would be worth a 
JIRA ticket and see what the devs think about it. If optimizing this case would 
make sense.

Hannu

> On 16 May 2017, at 18:03, Stefano Ortolani  wrote:
> 
> Hi Hannu,
> 
> the piece of data in question is older. In my example the tombstone is the 
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and the 
> data is clustering key sorted, I would expect a linear scan not to be 
> necessary.
> 
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  > wrote:
> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
> bigger regions of deleted data based on range tombstone. If some piece of 
> data in a partition is newer than the tombstone, then it cannot be skipped. 
> Therefore some partition level statistics of cell ages would need to be kept 
> in the column index for the skipping and that is probably not there.
> 
> Hannu 
> 
>> On 16 May 2017, at 17:33, Stefano Ortolani > > wrote:
>> 
>> That is another way to see the question: are reverse iterators range 
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior. 
>> I would expect them to handle this case more gracefully.
>> 
>> Cheers,
>> Stefano
>> 
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth > > wrote:
>> Hannu,
>> 
>> How can you read a partition in reverse?
>> 
>> Sent from my iPhone
>> 
>> > On May 16, 2017, at 9:20 AM, Hannu Kröger > > > wrote:
>> >
>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>> > tombstone is useful for this or not.
>> >
>> > In many cases it might be that the partition contains data that is within 
>> > the range of the tombstone but is newer than the tombstone and therefore 
>> > it might be still be returned. Scanning through deleted data can be 
>> > avoided by reading the partition in reverse (if all the deleted data is in 
>> > the beginning of the partition). Eventually you will still end up reading 
>> > a lot of tombstones but you will get a lot of live data first and the 
>> > implicit query limit of 1 probably is reached before you get to the 
>> > tombstones. Therefore you will get an immediate answer.
>> >
>> > Does it make sense?
>> >
>> > Hannu
>> >
>> >> On 16 May 2017, at 16:33, Stefano Ortolani > >> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>> >> partitions, and reverse iterators.
>> >> I still have to understand if the behaviour is to be expected hence the 
>> >> message on the mailing list.
>> >>
>> >> The situation is conceptually simple. I am using a table defined as 
>> >> follows:
>> >>
>> >> CREATE TABLE test_cql.test_cf (
>> >>  hash blob,
>> >>  timeid timeuuid,
>> >>  PRIMARY KEY (hash, timeid)
>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> >>
>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
>> >> _half_ of that partition by executing the query below, and restart the 
>> >> node:
>> >>
>> >> DELETE
>> >> FROM test_cql.test_cf
>> >> WHERE hash = x AND timeid < y;
>> >>
>> >> If I keep compactions disabled the following query timeouts (takes more 
>> >> than 10 seconds to
>> >> succeed):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid ASC;
>> >>
>> >> While the following returns immediately (obviously because no deleted 
>> >> data is ever read):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid DESC;
>> >>
>> >> If I force a compaction the problem is gone, but I presume just because 
>> >> the data is rearranged.
>> >>
>> >> It seems to me that reading by ASC does not make use of the range 
>> >> tombstone until C* reads the
>> >> last sstables (which actually contains the range tombstone and is flushed 
>> >> at node restart), and it wastes time reading all rows that are actually 
>> >> not live anymore.
>> >>
>> >> Is this expected? Should the range tombstone actually help in these cases?
>> >>
>> >> Thanks a lot!
>> >> Stefano
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>> > 
>> > For additional commands, e-mail:

Re: Long running compaction on huge hint table.

2017-05-16 Thread Nitan Kainth

Do you see mutation drops?
Select count from system.hints; is it increasing?

Sent from my iPhone

> On May 16, 2017, at 5:52 AM, varun saluja  wrote:
> 
> Hi Experts,
> 
> We are facing issue on production cluster. Compaction on system.hint table is 
> running from last 2 days.
> 
> 
> pending tasks: 1
>compaction type   keyspace   table completed  total
>   unit   progress
>   Compaction system   hints   20623021829   877874092407   
> bytes  2.35%
> Active compaction remaining time :   0h27m15s
> 
> 
> Active compaction remaining time shows in minutes.  But, this is job is 
> running like indefinitely.
> 
> We have 3 node cluster V 2.1.7. And we ran  write intensive job last week on 
> particular table.
> Compaction on this table finished but hint table size is growing continuously.
> 
> Can someone Please help me.
> 
> 
> Thanks & Regards,
> Varun Saluja

Re: Reg:- DSE 5.1.0 Issue

2017-05-16 Thread DuyHai Doan

Nandan

Since you have asked many times questions about DSE on this OSS mailing
list, I suggest you to contact directly Datastax if you're using their
enterprise edition. Every Datastax customer has access to their support. If
you're a sub-contractor for a final customer that is using DSE, ask your
customer to get this support access. On this OSS mailing list we cannot
answer questions related to a commercial product.



On Tue, May 16, 2017 at 1:07 PM, Hannu Kröger  wrote:

> Hello,
>
> DataStax is probably more than happy answer your particaly DataStax
> Enterprise related questions here (I don’t know if that is 100% right place
> but…):
> https://support.datastax.com/hc/en-us
>
> This mailing list is for open source Cassandra and DSE issues are mostly
> out of the scope here. HADOOP is one of DSE-only features.
>
> Cheers,
> Hannu
>
> On 16 May 2017, at 14:01, @Nandan@  wrote:
>
> Hi ,
> Sorry in Advance if I am posting here .
>
> I stuck in some particular steps.
>
> I was using DSE 4.8 on Single DC with 3 nodes. Today I upgraded my all 3
> nodes to DSE 5.1
> Issue is when I am trying to start SERVICE DSE RESTART i am getting error
> message as
>
> Hadoop functionality has been removed from DSE.
> Please try again without the HADOOP_ENABLED set in /etc/default/dse.
>
> Even in /etc/default//dse file , HADOOP_ENABLED is set as 0 .
>
> For testing ,Once I changed my HADOOP_ENABLED = 1 ,
>
> I  am getting error as
>
> Found multiple DSE core jar files in /usr/share/dse/lib
> /usr/share/dse/resources/dse/lib /usr/share/dse /usr/share/dse/common .
> Please make sure there is only one.
>
> I searched so many article , but till now not able to find the solution.
> Please help me to get out of this mess.
>
> Thanks and Best Regards,
> Nandan Priyadarshi.
>
>
>

Bootstraping a Node With a Newer Version

2017-05-16 Thread Shalom Sagges

Hi All,

Hypothetically speaking, let's say I want to upgrade my Cassandra cluster,
but I also want to perform a major upgrade to the kernel of all nodes.
In order to upgrade the kernel, I need to reinstall the server, hence lose
all data on the node.

My question is this, after reinstalling the server with the new kernel, can
I first install the upgraded Cassandra version and then bootstrap it to the
cluster?

Since there's already no data on the node, I wish to skip the agonizing
sstable upgrade process.

Does anyone know if this is doable?

Thanks!



Shalom Sagges
DBA
T: +972-74-700-4035
 
 We Create Meaningful Connections

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Re: Long running compaction on huge hint table.

2017-05-16 Thread Jason Brown

Varun,

This a message better for the user@ ML.

Thanks,

-Jason

On Tue, May 16, 2017 at 3:41 AM, varun saluja  wrote:

> Hi Experts,
>
> We are facing issue on production cluster. Compaction on system.hint table
> is running from last 2 days.
>
>
> pending tasks: 1
>compaction type   keyspace   table completed  total
>   unit   progress
>   Compaction system   hints   20623021829   877874092407
>  bytes  2.35%
> Active compaction remaining time :   0h27m15s
>
>
> Active compaction remaining time shows in minutes.  But, this is job is
> running like indefinitely.
>
> We have 3 node cluster V 2.1.7. And we ran  write intensive job last week
> on particular table.
> Compaction on this table finished but hint table size is growing
> continuously.
>
> Can someone Please help me.
>
>
> Thanks & Regards,
> Varun Saluja
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: Bootstraping a Node With a Newer Version

2017-05-16 Thread Mateusz Korniak

On Tuesday 16 of May 2017 15:27:11 Shalom Sagges wrote:
> My question is this, after reinstalling the server with the new kernel, can
> I first install the upgraded Cassandra version and then bootstrap it to the
> cluster?
No.

Bootstrap/repair may/will not work between nodes with different major 
versions.


Regards,
-- 
Mateusz Korniak
"(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś,
krótko mówiąc - podpora społeczeństwa."
Nikos Kazantzakis - "Grek Zorba"


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Long running compaction on huge hint table.

2017-05-16 Thread varun saluja

Hi Experts,

We are facing issue on production cluster. Compaction on system.hint table
is running from last 2 days.


pending tasks: 1
   compaction type   keyspace   table completed  total
unit   progress
  Compaction system   hints   20623021829   *877874092407*
 bytes  2.35%
Active compaction remaining time :   0h27m15s


Active compaction remaining time shows in minutes.  But, this is job is
running like indefinitely.

We have 3 node cluster V 2.1.7. And we ran  write intensive job last week
on particular table.
Compaction on this table finished but hint table size is growing
continuously.

Can someone Please help me.


Thanks & Regards,
Varun Saluja

Reg:- DSE 5.1.0 Issue

2017-05-16 Thread @Nandan@

Hi ,
Sorry in Advance if I am posting here .

I stuck in some particular steps.

I was using DSE 4.8 on Single DC with 3 nodes. Today I upgraded my all 3
nodes to DSE 5.1
Issue is when I am trying to start SERVICE DSE RESTART i am getting error
message as

Hadoop functionality has been removed from DSE.
Please try again without the HADOOP_ENABLED set in /etc/default/dse.

Even in /etc/default//dse file , HADOOP_ENABLED is set as 0 .

For testing ,Once I changed my HADOOP_ENABLED = 1 ,

I  am getting error as

Found multiple DSE core jar files in /usr/share/dse/lib
/usr/share/dse/resources/dse/lib /usr/share/dse /usr/share/dse/common .
Please make sure there is only one.

I searched so many article , but till now not able to find the solution.
Please help me to get out of this mess.

Thanks and Best Regards,
Nandan Priyadarshi.

Re: Reg:- DSE 5.1.0 Issue

2017-05-16 Thread Hannu Kröger

Hello,

DataStax is probably more than happy answer your particaly DataStax Enterprise 
related questions here (I don’t know if that is 100% right place but…):
https://support.datastax.com/hc/en-us 

This mailing list is for open source Cassandra and DSE issues are mostly out of 
the scope here. HADOOP is one of DSE-only features.

Cheers,
Hannu

> On 16 May 2017, at 14:01, @Nandan@  wrote:
> 
> Hi ,
> Sorry in Advance if I am posting here .
> 
> I stuck in some particular steps. 
> 
> I was using DSE 4.8 on Single DC with 3 nodes. Today I upgraded my all 3 
> nodes to DSE 5.1
> Issue is when I am trying to start SERVICE DSE RESTART i am getting error 
> message as 
> 
> Hadoop functionality has been removed from DSE.
> Please try again without the HADOOP_ENABLED set in /etc/default/dse.
> 
> Even in /etc/default//dse file , HADOOP_ENABLED is set as 0 . 
> 
> For testing ,Once I changed my HADOOP_ENABLED = 1 , 
> 
> I  am getting error as 
> 
> Found multiple DSE core jar files in /usr/share/dse/lib 
> /usr/share/dse/resources/dse/lib /usr/share/dse /usr/share/dse/common . 
> Please make sure there is only one.
> 
> I searched so many article , but till now not able to find the solution. 
> Please help me to get out of this mess. 
> 
> Thanks and Best Regards,
> Nandan Priyadarshi.

Re: LCS, range tombstones, and eviction

2017-05-16 Thread Stefano Ortolani

That makes sense.
I see however some unexpected performance data on my test, but I will start
another thread for that.

Thanks again!

On Fri, May 12, 2017 at 6:56 PM, Blake Eggleston 
wrote:

> The start and end points of a range tombstone are basically stored as
> special purpose rows alongside the normal data in an sstable. As part of a
> read, they're reconciled with the data from the other sstables into a
> single partition, just like the other rows. The only difference is that
> they don't contain any 'real' data, and, of course, they prevent 'deleted'
> data from being returned in the read. It's a bit more complicated than
> that, but that's the general idea.
>
>
> On May 12, 2017 at 6:23:01 AM, Stefano Ortolani (ostef...@gmail.com)
> wrote:
>
> Thanks a lot Blake, that definitely helps!
>
> I actually found a ticket re range tombstones and how they are accounted
> for: https://issues.apache.org/jira/browse/CASSANDRA-8527
>
> I am wondering now what happens when a node receives a read request. Are
> the range tombstones read before scanning the SStables? More interestingly,
> given that a single partition might be split across different levels, and
> that some range tombstones might be in L0 while all the rest of the data in
> L1, are all the tombstones prefetched from _all_ the involved SStables
> before doing any table scan?
>
> Regards,
> Stefano
>
> On Thu, May 11, 2017 at 7:58 PM, Blake Eggleston 
> wrote:
>
>> Hi Stefano,
>>
>> Based on what I understood reading the docs, if the ratio of garbage
>> collectable tomstones exceeds the "tombstone_threshold", C* should start
>> compacting and evicting.
>>
>>
>> If there are no other normal compaction tasks to be run, LCS will attempt
>> to compact the sstables it estimates it will be able to drop the most
>> tombstones from. It does this by estimating the number of tombstones an
>> sstable has that have passed the gc grace period. Whether or not a
>> tombstone will actually be evicted is more complicated. Even if a tombstone
>> has passed gc grace, it can't be dropped if the data it's deleting still
>> exists in another sstable, otherwise the data would appear to return. So, a
>> tombstone won't be dropped if there is data for the same partition in other
>> sstables that is older than the tombstone being evaluated for eviction.
>>
>> I am quite puzzled however by what might happen when dealing with range
>> tombstones. In that case a single tombstone might actually stand for an
>> arbitrary number of normal tombstones. In other words, do range
>> tombstones
>> contribute to the "tombstone_threshold"? If so, how?
>>
>>
>> From what I can tell, each end of the range tombstone is counted as a
>> single tombstone tombstone. So a range tombstone effectively contributes
>> '2' to the count of tombstones for an sstable. I'm not 100% sure, but I
>> haven't seen any sstable writing logic that tracks open tombstones and
>> counts covered cells as tombstones. So, it's likely that the effect of
>> range tombstones covering many rows are under represented in the droppable
>> tombstone estimate.
>>
>> I am also a bit confused by the "tombstone_compaction_interval". If I am
>> dealing with a big partition in LCS which is receiving new records every
>> day,
>> and a weekly incremental repair job continously anticompacting the data
>> and
>> thus creating SStables, what is the likelhood of the default interval
>> (10 days) to be actually hit?
>>
>>
>> It will be hit, but probably only in the repaired data. Once the data is
>> marked repaired, it shouldn't be anticompacted again, and should get old
>> enough to pass the compaction interval. That shouldn't be an issue though,
>> because you should be running repair often enough that data is repaired
>> before it can ever get past the gc grace period. Otherwise you'll have
>> other problems. Also, keep in mind that tombstone eviction is a part of all
>> compactions, it's just that occasionally a compaction is run specifically
>> for that purpose. Finally, you probably shouldn't run incremental repair on
>> data that is deleted. There is a design flaw in the incremental repair used
>> in pre-4.0 of cassandra that can cause consistency issues. It can also
>> cause a *lot* of over streaming, so you might want to take a look at how
>> much streaming your cluster is doing with full repairs, and incremental
>> repairs. It might actually be more efficient to run full repairs.
>>
>> Hope that helps,
>>
>> Blake
>>
>> On May 11, 2017 at 7:16:26 AM, Stefano Ortolani (ostef...@gmail.com)
>> wrote:
>>
>> Hi all,
>>
>> I am trying to wrap my head around how C* evicts tombstones when using
>> LCS.
>> Based on what I understood reading the docs, if the ratio of garbage
>> collectable tomstones exceeds the "tombstone_threshold", C* should start
>> compacting and evicting.
>>
>> I am quite puzzled however by what might happen when dealing with range
>> tombstones. In that case a single

Re: Long running compaction on huge hint table.

2017-05-16 Thread Nitan Kainth

If target table is dropped then you can remove its hints but there could be 
more hints from other table. If it has tables of your interest , then I won't 
comment on truncating hints.

Size of hints depends on Kafka load , looks like you had overloaded the cluster 
during data load and not hints are just recovering from it. I would say wait 
until cluster comes to normal state. May be some other expert can suggest an 
alternate.

Sent from my iPhone

> On May 16, 2017, at 8:50 AM, varun saluja  wrote:
> 
> Thanks for update.
> I could see lot of io waits. This causing  Gc and mutation drops .
> But as i mentioned we do not have high load for now. Hint replays are 
> creating such high disk I/O.
> compactionstats show very high hint bytes like 780gb around. Is this normal?
> 
> Just mentioning we are using flash disks.
> 
> In such case, if i run truncatehints , will it remove or decrease size of 
> hints bytes in compaction stats. I can trigger repair therafter.
> Please let me know if any recommendation on same.
> 
> Also , table which we dumped from kafka which created this much hints and 
> compaction pendings is also dropped today. Because we have to redump table 
> again once cluster is stable.
> 
> Regards,
> Varun
> 
> Sent from my iPhone
> 
>> On 16-May-2017, at 6:59 PM, Nitan Kainth  wrote:
>> 
>> Yes but it means data has to be replicated using repair.
>> 
>> Hints are out come of unhealthy nodes, focus on finding why you have 
>> mutation drops, is it node, io or network etc. ideally you shouldn't see 
>> increasing hints all the time.
>> 
>> Sent from my iPhone
>> 
>>> On May 16, 2017, at 7:58 AM, varun saluja  wrote:
>>> 
>>> Hi Nitan,
>>> 
>>> Thanks for response.
>>> 
>>> Yes, I could see mutation drops and increase count in system.hints. Is 
>>> there any way , i can proceed to truncate hints like using nodetool 
>>> truncatehints.
>>> 
>>> 
>>> Regards,
>>> Varun Saluja
>>> 
 On 16 May 2017 at 17:52, Nitan Kainth  wrote:
 Do you see mutation drops?
 Select count from system.hints; is it increasing?
 
 Sent from my iPhone
 
> On May 16, 2017, at 5:52 AM, varun saluja  wrote:
> 
> Hi Experts,
> 
> We are facing issue on production cluster. Compaction on system.hint 
> table is running from last 2 days.
> 
> 
> pending tasks: 1
>compaction type   keyspace   table completed  total
>   unit   progress
>   Compaction system   hints   20623021829   877874092407  
>  bytes  2.35%
> Active compaction remaining time :   0h27m15s
> 
> 
> Active compaction remaining time shows in minutes.  But, this is job is 
> running like indefinitely.
> 
> We have 3 node cluster V 2.1.7. And we ran  write intensive job last week 
> on particular table.
> Compaction on this table finished but hint table size is growing 
> continuously.
> 
> Can someone Please help me.
> 
> 
> Thanks & Regards,
> Varun Saluja
> 
>>>

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger

Well, I’m guessing that Cassandra doesn't really know if the range tombstone is 
useful for this or not. 

In many cases it might be that the partition contains data that is within the 
range of the tombstone but is newer than the tombstone and therefore it might 
be still be returned. Scanning through deleted data can be avoided by reading 
the partition in reverse (if all the deleted data is in the beginning of the 
partition). Eventually you will still end up reading a lot of tombstones but 
you will get a lot of live data first and the implicit query limit of 1 
probably is reached before you get to the tombstones. Therefore you will get an 
immediate answer.

Does it make sense?

Hannu

> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
> 
> Hi all,
> 
> I am seeing inconsistencies when mixing range tombstones, wide partitions, 
> and reverse iterators.
> I still have to understand if the behaviour is to be expected hence the 
> message on the mailing list.
> 
> The situation is conceptually simple. I am using a table defined as follows:
> 
> CREATE TABLE test_cql.test_cf (
>   hash blob,
>   timeid timeuuid,
>   PRIMARY KEY (hash, timeid)
> ) WITH CLUSTERING ORDER BY (timeid ASC)
>   AND compaction = {'class' : 'LeveledCompactionStrategy'};
> 
> I then proceed by loading 2/3GB from 3 sstables which I know contain a really 
> wide partition (> 512 MB) for `hash = x`. I then delete the oldest _half_ of 
> that partition by executing the query below, and restart the node:
> 
> DELETE 
> FROM test_cql.test_cf 
> WHERE hash = x AND timeid < y;
> 
> If I keep compactions disabled the following query timeouts (takes more than 
> 10 seconds to 
> succeed):
> 
> SELECT * 
> FROM test_cql.test_cf 
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
> ORDER BY timeid ASC;
> 
> While the following returns immediately (obviously because no deleted data is 
> ever read):
> 
> SELECT * 
> FROM test_cql.test_cf 
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
> ORDER BY timeid DESC;
> 
> If I force a compaction the problem is gone, but I presume just because the 
> data is rearranged.
> 
> It seems to me that reading by ASC does not make use of the range tombstone 
> until C* reads the
> last sstables (which actually contains the range tombstone and is flushed at 
> node restart), and it wastes time reading all rows that are actually not live 
> anymore. 
> 
> Is this expected? Should the range tombstone actually help in these cases?
> 
> Thanks a lot!
> Stefano


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Replication issue with Multi DC setup in cassandra

2017-05-16 Thread suraj pasuparthy

Hello,
I am tying to find a way to PREVENT just one of my keyspaces to not sync to
the other datacenter.

I have 2 datacenters setup this way :

Datacenter: DC:4.4.4.4

==

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address  Load   Tokens   Owns (effective)  Host ID
  Rack

UN  4.4.4.4  189.42 KiB  32   100.0%
939f5965-f9d5-4673-b1a4-29fa5ecae0f9  rack1

Datacenter: DC:4.4.4.5

==

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address  Load   Tokens   Owns (effective)  Host ID
  Rack

UN  4.4.4.5  218.29 KiB  32   100.0%
c0d1d859-7ae9-4ce9-a50f-ea316963dbb1  rack1

all my keyspaces have 1 copy in each DC and it works like a charm.

however, I have ONE keyspace that i do not want to sync. and i define the
keyspace this way:

CREATE KEYSPACE nosync WITH replication = {'class':
'NetworkTopologyStrategy', 'DC:4.4.4.4': '1', 'DC:4.4.4.5': '0'}  AND
durable_writes = true;
and i still see that keyspace show up in DC:4.4.4.5
i even tried

CREATE KEYSPACE nosync WITH replication = {'class':
'NetworkTopologyStrategy', 'DC:4.4.4.4': '1'}  AND durable_writes = true;

Same issue, i still see the keyspace show up in the DC:4.4.4.5

Could anyone help me figure this out?

Cheers

-Suraj

Re: Long running compaction on huge hint table.

2017-05-16 Thread varun saluja

Hi Nitan,

Thanks for response.

Yes, I could see mutation drops and increase count in system.hints. Is
there any way , i can proceed to truncate hints like using nodetool
truncatehints.


Regards,
Varun Saluja

On 16 May 2017 at 17:52, Nitan Kainth  wrote:

> Do you see mutation drops?
> Select count from system.hints; is it increasing?
>
> Sent from my iPhone
>
> On May 16, 2017, at 5:52 AM, varun saluja  wrote:
>
> Hi Experts,
>
> We are facing issue on production cluster. Compaction on system.hint table
> is running from last 2 days.
>
>
> pending tasks: 1
>compaction type   keyspace   table completed  total
>   unit   progress
>   Compaction system   hints   20623021829   *877874092407*
>  bytes  2.35%
> Active compaction remaining time :   0h27m15s
>
>
> Active compaction remaining time shows in minutes.  But, this is job is
> running like indefinitely.
>
> We have 3 node cluster V 2.1.7. And we ran  write intensive job last week
> on particular table.
> Compaction on this table finished but hint table size is growing
> continuously.
>
> Can someone Please help me.
>
>
> Thanks & Regards,
> Varun Saluja
>
>

Reg:- Data Modelling Concepts

2017-05-16 Thread @Nandan@

The requirement is to create DB in which we have to keep data of Updated
values as well as which user update the particular book details and what
they update.

We are like to create a schema which store book info, as well as the
history of the update, made based on book_title, author, publisher, price
changed.
Like we want to store what was old data and what new data updated.. and
also want to check which user updated the relevant change. Because suppose
if some changes not made correctly then they can check changes and revert
based on old values.
We are trying to make a USER based Schema.

For example:-
id:- 1
Name: - Harry Poter
Author : - JK Rolling

New Update Done by user_id 2:-
id :- 1
Name:- Harry Pottor
Author:- J.K. Rolls

Update history also need to store as :-
User_id :- 2
Old Author :- JK Rolling
New Author :- J.K. Rolls

So I need to update the details of Book which is done by UPSERT. But also I
have to keep details like which user updated and what updated.


One thing that helps define the schema is knowing what queries will be made
to the database up front.
Few queries that the database needs to answer.
What are the current details of a book?
What is the most recent update to a particular book?
What are the updates that have been made to a particular book?
What are the details for a particular update?


Update frequently will be like Update will happen based on Title, name,
Author, price , publisher like. So not very high frequently.

Best Regards,
Nandan

Re: Reg:- Data Modelling Concepts

2017-05-16 Thread Jonathan Haddad

I don't understand why you need to store the old value a second time.  If
you know that the value went from A -> B -> C, just store the new value,
not the old.  You can see that it changed from A->B->C without storing it
twice.

On Tue, May 16, 2017 at 6:36 PM @Nandan@ 
wrote:

> The requirement is to create DB in which we have to keep data of Updated
> values as well as which user update the particular book details and what
> they update.
>
> We are like to create a schema which store book info, as well as the
> history of the update, made based on book_title, author, publisher, price
> changed.
> Like we want to store what was old data and what new data updated.. and
> also want to check which user updated the relevant change. Because suppose
> if some changes not made correctly then they can check changes and revert
> based on old values.
> We are trying to make a USER based Schema.
>
> For example:-
> id:- 1
> Name: - Harry Poter
> Author : - JK Rolling
>
> New Update Done by user_id 2:-
> id :- 1
> Name:- Harry Pottor
> Author:- J.K. Rolls
>
> Update history also need to store as :-
> User_id :- 2
> Old Author :- JK Rolling
> New Author :- J.K. Rolls
>
> So I need to update the details of Book which is done by UPSERT. But also
> I have to keep details like which user updated and what updated.
>
>
> One thing that helps define the schema is knowing what queries will be
> made to the database up front.
> Few queries that the database needs to answer.
> What are the current details of a book?
> What is the most recent update to a particular book?
> What are the updates that have been made to a particular book?
> What are the details for a particular update?
>
>
> Update frequently will be like Update will happen based on Title, name,
> Author, price , publisher like. So not very high frequently.
>
> Best Regards,
> Nandan
>

Re: Reg:- Data Modelling Concepts

2017-05-16 Thread Jonathan Haddad

Sorry, I hit return a little early.  What you want is called "event
sourcing": https://martinfowler.com/eaaDev/EventSourcing.html

Think of it as time series applied to state (instead of mutable state)

CREATE TABLE book (
name text,
ts timeuuid,
author text,
primary key(bookid, ts)
);

for example, if you insert the record:

insert into book (name, ts, author) values ('jon talks data modeling',
now(), 'jon haddad');

and then you find out that my first name is actually jonathan:
insert into book (name, ts, author) values ('jon talks data modeling',
now(), 'jonathan haddad');

now you've got 2 records for book, with a full history of the changes.  The
last change has the current record.

Jon

On Tue, May 16, 2017 at 6:44 PM Jonathan Haddad  wrote:

> I don't understand why you need to store the old value a second time.  If
> you know that the value went from A -> B -> C, just store the new value,
> not the old.  You can see that it changed from A->B->C without storing it
> twice.
>
> On Tue, May 16, 2017 at 6:36 PM @Nandan@ 
> wrote:
>
>> The requirement is to create DB in which we have to keep data of Updated
>> values as well as which user update the particular book details and what
>> they update.
>>
>> We are like to create a schema which store book info, as well as the
>> history of the update, made based on book_title, author, publisher, price
>> changed.
>> Like we want to store what was old data and what new data updated.. and
>> also want to check which user updated the relevant change. Because suppose
>> if some changes not made correctly then they can check changes and revert
>> based on old values.
>> We are trying to make a USER based Schema.
>>
>> For example:-
>> id:- 1
>> Name: - Harry Poter
>> Author : - JK Rolling
>>
>> New Update Done by user_id 2:-
>> id :- 1
>> Name:- Harry Pottor
>> Author:- J.K. Rolls
>>
>> Update history also need to store as :-
>> User_id :- 2
>> Old Author :- JK Rolling
>> New Author :- J.K. Rolls
>>
>> So I need to update the details of Book which is done by UPSERT. But also
>> I have to keep details like which user updated and what updated.
>>
>>
>> One thing that helps define the schema is knowing what queries will be
>> made to the database up front.
>> Few queries that the database needs to answer.
>> What are the current details of a book?
>> What is the most recent update to a particular book?
>> What are the updates that have been made to a particular book?
>> What are the details for a particular update?
>>
>>
>> Update frequently will be like Update will happen based on Title, name,
>> Author, price , publisher like. So not very high frequently.
>>
>> Best Regards,
>> Nandan
>>
>

Re: Long running compaction on huge hint table.

2017-05-16 Thread varun saluja

Hi,


 Truncatehints on nodes is running for more than 7 hours now. Nothing
mentioned for same in sysemt logs even.

And compaction stats reports increase in hints total bytes.

pending tasks: 1
   compaction type   keyspace   table completed  totalunit
  progress
Compaction system   hints   12152557998   869257869352   bytes
 1.40%
Active compaction remaining time :   0h27m14s

Can anything else be checked here? Will manually deleting system.hint files
and restart node fix this.



Regards,
Varun Saluja

On 16 May 2017 at 23:29, varun saluja  wrote:

> Hi Jeff,
>
> I ran nodetool truncatehints  on all nodes. Its running for more than 30
> mins now. Status for compactstats reports same.
>
> pending tasks: 1
>compaction type   keyspace   table completed  totalunit
>   progress
> Compaction system   hints   11189118129   851658989612   bytes
>  1.31%
> Active compaction remaining time :   0h26m43s
>
> Will truncatehints takes time for completion? Could not see anything
> related truncatehints in system logs.
>
> Please let me know if anything else can be checked here.
>
> Regards,
> Varun Saluja
>
>
>
> On 16 May 2017 at 20:58, varun saluja  wrote:
>
>> Thanks a lot Jeff.
>>
>> You have explaned very well here. We have consitency as local quorum.
>> Will follow truncate hints and repair therafter.
>>
>> I hope this brings cluster in stable state
>>
>> Thanks again.
>>
>> Regards,
>> Varun Saluja
>>
>> Sent from my iPhone
>>
>> > On 16-May-2017, at 8:42 PM, Jeff Jirsa  wrote:
>> >
>> >
>> > In Cassandra versions up to 3.0, hints are stored within a table, where
>> the partition key is the host ID of the server for which the hints are
>> stored.
>> >
>> > In such a data model, accumulating 800GB of hints is almost certain to
>> cause very wide rows, which will in turn cause GC pressure when you attempt
>> to read the hints for delivery. This will cause GC pauses, which will cause
>> hints to fail to be delivered, which will cause more hints to be stored.
>> This is bad.
>> >
>> > In 3.0, hints were rewritten to work around this design flaw. In 2.1,
>> your most likely corrective course is to use 'nodetool truncatehints' on
>> all servers, followed by 'nodetool repair' to deliver the data you lost by
>> truncating the hints.
>> >
>> > NOTE: this is ONLY safe if you wrote with a consistency level stronger
>> than CL:ANY. If you wrote this data with CL:ANY, you may lose data if you
>> truncate hints.
>> >
>> > - Jeff
>> >
>> >> On 2017-05-16 06:50 (-0700), varun saluja  wrote:
>> >> Thanks for update.
>> >> I could see lot of io waits. This causing  Gc and mutation drops .
>> >> But as i mentioned we do not have high load for now. Hint replays are
>> creating such high disk I/O.
>> >> compactionstats show very high hint bytes like 780gb around. Is this
>> normal?
>> >>
>> >> Just mentioning we are using flash disks.
>> >>
>> >> In such case, if i run truncatehints , will it remove or decrease size
>> of hints bytes in compaction stats. I can trigger repair therafter.
>> >> Please let me know if any recommendation on same.
>> >>
>> >> Also , table which we dumped from kafka which created this much hints
>> and compaction pendings is also dropped today. Because we have to redump
>> table again once cluster is stable.
>> >>
>> >> Regards,
>> >> Varun
>> >>
>> >> Sent from my iPhone
>> >>
>> >>> On 16-May-2017, at 6:59 PM, Nitan Kainth  wrote:
>> >>>
>> >>> Yes but it means data has to be replicated using repair.
>> >>>
>> >>> Hints are out come of unhealthy nodes, focus on finding why you have
>> mutation drops, is it node, io or network etc. ideally you shouldn't see
>> increasing hints all the time.
>> >>>
>> >>> Sent from my iPhone
>> >>>
>>  On May 16, 2017, at 7:58 AM, varun saluja 
>> wrote:
>> 
>>  Hi Nitan,
>> 
>>  Thanks for response.
>> 
>>  Yes, I could see mutation drops and increase count in system.hints.
>> Is there any way , i can proceed to truncate hints like using nodetool
>> truncatehints.
>> 
>> 
>>  Regards,
>>  Varun Saluja
>> 
>> > On 16 May 2017 at 17:52, Nitan Kainth  wrote:
>> > Do you see mutation drops?
>> > Select count from system.hints; is it increasing?
>> >
>> > Sent from my iPhone
>> >
>> >> On May 16, 2017, at 5:52 AM, varun saluja 
>> wrote:
>> >>
>> >> Hi Experts,
>> >>
>> >> We are facing issue on production cluster. Compaction on
>> system.hint table is running from last 2 days.
>> >>
>> >>
>> >> pending tasks: 1
>> >>   compaction type   keyspace   table completed  total
>> unit   progress
>> >>  Compaction system   hints   20623021829
>>  877874092407   bytes  2.35%
>> >> Active compaction

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger

Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
bigger regions of deleted data based on range tombstone. If some piece of data 
in a partition is newer than the tombstone, then it cannot be skipped. 
Therefore some partition level statistics of cell ages would need to be kept in 
the column index for the skipping and that is probably not there.

Hannu 

> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
> 
> That is another way to see the question: are reverse iterators range 
> tombstone aware? Yes.
> That is why I am puzzled by this afore-mentioned behavior. 
> I would expect them to handle this case more gracefully.
> 
> Cheers,
> Stefano
> 
> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  > wrote:
> Hannu,
> 
> How can you read a partition in reverse?
> 
> Sent from my iPhone
> 
> > On May 16, 2017, at 9:20 AM, Hannu Kröger  > > wrote:
> >
> > Well, I’m guessing that Cassandra doesn't really know if the range 
> > tombstone is useful for this or not.
> >
> > In many cases it might be that the partition contains data that is within 
> > the range of the tombstone but is newer than the tombstone and therefore it 
> > might be still be returned. Scanning through deleted data can be avoided by 
> > reading the partition in reverse (if all the deleted data is in the 
> > beginning of the partition). Eventually you will still end up reading a lot 
> > of tombstones but you will get a lot of live data first and the implicit 
> > query limit of 1 probably is reached before you get to the tombstones. 
> > Therefore you will get an immediate answer.
> >
> > Does it make sense?
> >
> > Hannu
> >
> >> On 16 May 2017, at 16:33, Stefano Ortolani  >> > wrote:
> >>
> >> Hi all,
> >>
> >> I am seeing inconsistencies when mixing range tombstones, wide partitions, 
> >> and reverse iterators.
> >> I still have to understand if the behaviour is to be expected hence the 
> >> message on the mailing list.
> >>
> >> The situation is conceptually simple. I am using a table defined as 
> >> follows:
> >>
> >> CREATE TABLE test_cql.test_cf (
> >>  hash blob,
> >>  timeid timeuuid,
> >>  PRIMARY KEY (hash, timeid)
> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
> >>
> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
> >> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
> >> _half_ of that partition by executing the query below, and restart the 
> >> node:
> >>
> >> DELETE
> >> FROM test_cql.test_cf
> >> WHERE hash = x AND timeid < y;
> >>
> >> If I keep compactions disabled the following query timeouts (takes more 
> >> than 10 seconds to
> >> succeed):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid ASC;
> >>
> >> While the following returns immediately (obviously because no deleted data 
> >> is ever read):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid DESC;
> >>
> >> If I force a compaction the problem is gone, but I presume just because 
> >> the data is rearranged.
> >>
> >> It seems to me that reading by ASC does not make use of the range 
> >> tombstone until C* reads the
> >> last sstables (which actually contains the range tombstone and is flushed 
> >> at node restart), and it wastes time reading all rows that are actually 
> >> not live anymore.
> >>
> >> Is this expected? Should the range tombstone actually help in these cases?
> >>
> >> Thanks a lot!
> >> Stefano
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> > 
> > For additional commands, e-mail: user-h...@cassandra.apache.org 
> > 
> >
>

Re: Non-zero nodes are marked as down after restarting cassandra process

2017-05-16 Thread Jeff Jirsa

On 2017-05-16 07:07 (-0700), Andrew Jorgensen  
wrote: 
> Thanks for the info!
> 
> When you say "overall stability problems due to some bugs", can you
> elaborate on if those were bugs in cassandra that were fixed due to an
> upgrade or bugs in your own code and how you used cassandra. If the latter
> would  it be possible to highlight what the most impactful fix was from the
> usage side.

For what it's worth, there have been HUNDREDS of bugs fixed in 3.0 since your 
3.0.3 release, many of which are fairly important - while it's unlikely to fix 
the behavior you describe, upgrading to latest 3.0 is probably a good idea.

Anecdotally, the behavior you describe is similar to a condition I saw once at 
a previous employer on a very different (much older) version of cassandra, and 
it was accompanied by a few thousand bytes in a tcp send queue that lasted long 
after I'd have expected it to be closed. Never really investigated, but if you 
see it happen again, capturing the output of 'netstat -n' and 'lsof' on the 
servers involved would help understand what's going on (open a jira, upload the 
output).

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Decommissioned node cluster shows as down

2017-05-16 Thread Mark Furlong

I have a node I decommissioned on a large ring using 2.1.12. The node completed 
the decommission process and is no longer communicating with the rest of the 
cluster. However when I run a nodetool status on any node in the cluster it 
shows the node as ‘DN’. Why is this and should I just run a removenode now?

Thanks,
Mark Furlong

Sr. Database Administrator

mfurl...@ancestry.com
M: 801-859-7427
O: 801-705-7115
1300 W Traverse Pkwy
Lehi, UT 84043





[http://c.mfcreative.com/mars/email/shared-icon/sig-logo.gif]

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Nitan Kainth

Thank you Stefano
> On May 16, 2017, at 10:56 AM, Stefano Ortolani  wrote:
> 
> No, because C* has reverse iterators.
> 
> On Tue, May 16, 2017 at 4:47 PM, Nitan Kainth  > wrote:
> If the data is stored in ASC order and query asks for DESC, then wouldn’t it 
> read whole partition in first and then pick data from reverse order?
> 
> 
>> On May 16, 2017, at 10:03 AM, Stefano Ortolani > > wrote:
>> 
>> Hi Hannu,
>> 
>> the piece of data in question is older. In my example the tombstone is the 
>> newest piece of data.
>> Since a range tombstone has information re the clustering key ranges, and 
>> the data is clustering key sorted, I would expect a linear scan not to be 
>> necessary.
>> 
>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger > > wrote:
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
>> bigger regions of deleted data based on range tombstone. If some piece of 
>> data in a partition is newer than the tombstone, then it cannot be skipped. 
>> Therefore some partition level statistics of cell ages would need to be kept 
>> in the column index for the skipping and that is probably not there.
>> 
>> Hannu 
>> 
>>> On 16 May 2017, at 17:33, Stefano Ortolani >> > wrote:
>>> 
>>> That is another way to see the question: are reverse iterators range 
>>> tombstone aware? Yes.
>>> That is why I am puzzled by this afore-mentioned behavior. 
>>> I would expect them to handle this case more gracefully.
>>> 
>>> Cheers,
>>> Stefano
>>> 
>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth >> > wrote:
>>> Hannu,
>>> 
>>> How can you read a partition in reverse?
>>> 
>>> Sent from my iPhone
>>> 
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger >> > > wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>>> > tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is within 
>>> > the range of the tombstone but is newer than the tombstone and therefore 
>>> > it might be still be returned. Scanning through deleted data can be 
>>> > avoided by reading the partition in reverse (if all the deleted data is 
>>> > in the beginning of the partition). Eventually you will still end up 
>>> > reading a lot of tombstones but you will get a lot of live data first and 
>>> > the implicit query limit of 1 probably is reached before you get to 
>>> > the tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani >> >> > wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>>> >> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence the 
>>> >> message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as 
>>> >> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the 
>>> >> oldest _half_ of that partition by executing the query below, and 
>>> >> restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes more 
>>> >> than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>> >>
>>> >> While the following returns immediately (obviously because no deleted 
>>> >> data is ever read):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid DESC;
>>> >>
>>> >> If I force a compaction the problem is gone, but I presume just because 
>>> >> the data is rearranged.
>>> >>
>>> >> It seems to me that reading by ASC does not make use of the range 
>>> >> tombstone until C* reads the
>>> >> last sstables (which actually contains the range tombstone and is 
>>> >> flushed at node restart), and it wastes time reading all rows that are 
>>> >> actually not live anymore.
>>> >>
>>> >> Is this expected? Should the range tombstone actually help in these 
>>> >> cases?
>>> >>
>>> >> Thanks a lot!
>>> >> Stefano
>>> >
>>> >
>>> >

Re: Replication issue with Multi DC setup in cassandra

2017-05-16 Thread suraj pasuparthy

So i though the same,
I see the data via the CQLSH in both the datacenters. consistency is set to
LQ

thanks
-Suraj

On Tue, May 16, 2017 at 2:19 PM, Nitan Kainth  wrote:

> Do you see data on other DC or just directory structure? Directory
> structure would populate because it is DDL but inserts shouldn’t populate,
> ideally.
>
> On May 16, 2017, at 3:19 PM, suraj pasuparthy 
> wrote:
>
> elp me fig
>
>
>

-- 
Suraj Pasuparthy

cisco systems
Software Engineer
San Jose CA

Re: Replication issue with Multi DC setup in cassandra

2017-05-16 Thread Nitan Kainth

check for datafiles on filesystem in both DCs.

> On May 16, 2017, at 4:42 PM, suraj pasuparthy  
> wrote:
> 
> So i though the same,
> I see the data via the CQLSH in both the datacenters. consistency is set to LQ
> 
> thanks
> -Suraj
> 
> On Tue, May 16, 2017 at 2:19 PM, Nitan Kainth  > wrote:
> Do you see data on other DC or just directory structure? Directory structure 
> would populate because it is DDL but inserts shouldn’t populate, ideally.
> 
>> On May 16, 2017, at 3:19 PM, suraj pasuparthy > > wrote:
>> 
>> elp me fig
> 
> 
> 
> 
> -- 
> Suraj Pasuparthy
> 
> cisco systems
> Software Engineer
> San Jose CA

Re: Replication issue with Multi DC setup in cassandra

2017-05-16 Thread suraj pasuparthy

Yes is see them in the datacenter's data directories.. infact i see then
even after i bring down the interface between the 2 DC's which further
confirms that a local copy is maintained in the DC that was not configured
in the strategy ..
its quite important that we block the info for this keyspace from
replicating :(.. not sure why this does not work

Thanks
Suraj

On Tue, May 16, 2017 at 3:06 PM Nitan Kainth  wrote:

> check for datafiles on filesystem in both DCs.
>
> On May 16, 2017, at 4:42 PM, suraj pasuparthy 
> wrote:
>
> So i though the same,
> I see the data via the CQLSH in both the datacenters. consistency is set
> to LQ
>
> thanks
> -Suraj
>
> On Tue, May 16, 2017 at 2:19 PM, Nitan Kainth  wrote:
>
>> Do you see data on other DC or just directory structure? Directory
>> structure would populate because it is DDL but inserts shouldn’t populate,
>> ideally.
>>
>> On May 16, 2017, at 3:19 PM, suraj pasuparthy 
>> wrote:
>>
>> elp me fig
>>
>>
>>
>
>
> --
> Suraj Pasuparthy
>
> cisco systems
> Software Engineer
> San Jose CA
>
>
>
>
>
>

Re: Replication issue with Multi DC setup in cassandra

2017-05-16 Thread Nitan Kainth

Do you see data on other DC or just directory structure? Directory structure 
would populate because it is DDL but inserts shouldn’t populate, ideally.

> On May 16, 2017, at 3:19 PM, suraj pasuparthy  
> wrote:
> 
> elp me fig

61 matches

Mail list logo