Re: Truncate data from a single node

2017-07-12 Thread Kevin O'Connor
Thanks for the suggestions! Could altering the RF from 2 to 1 cause any
issues, or will it basically just be changing the coordinator's write paths
and also guiding future repairs/cleans?

On Wed, Jul 12, 2017 at 22:29 Jeff Jirsa  wrote:

>
>
> On 2017-07-11 20:09 (-0700), "Kevin O'Connor" 
> wrote:
> > This might be an interesting question - but is there a way to truncate
> data
> > from just a single node or two as a test instead of truncating from the
> > entire cluster? We have time series data we don't really care if we're
> > missing gaps in, but it's taking up a huge amount of space and we're
> > looking to clear some. I'm worried if we run a truncate on this huge CF
> > it'll end up locking up the cluster, but I don't care so much if it just
> > kills a single node.
> >
>
> IF YOU CAN TOLERATE DATA INCONSISTENCIES, You can stop a node, delete some
> sstables, and start it again. The risk in deleting arbitrary sstables is
> that you may remove a tombstone and bring data back to life, or remove the
> only replica with a write if you write at CL:ONE, but if you're OK with
> data going missing, you won't hurt much as long as you stop cassandra
> before you go killing sstables.
>
> TWCS does make this easier, because you can use sstablemetadata to
> identify timestamps/tombstone %s, and then nuke sstables that are
> old/mostly-expired first.
>
>
> > Is doing something like deleting SSTables from disk possible? If I alter
> > this keyspace from an RF of 2 down to 1 and then delete them, they won't
> be
> > able to be repaired if I'm thinking this through right.
> >
>
> If you drop RF from 2 to 1, you can just run cleanup and delete half the
> data (though it'll rewrite sstables to do it, which will be a short term
> increase).
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Truncate data from a single node

2017-07-12 Thread Jeff Jirsa


On 2017-07-11 20:09 (-0700), "Kevin O'Connor"  wrote: 
> This might be an interesting question - but is there a way to truncate data
> from just a single node or two as a test instead of truncating from the
> entire cluster? We have time series data we don't really care if we're
> missing gaps in, but it's taking up a huge amount of space and we're
> looking to clear some. I'm worried if we run a truncate on this huge CF
> it'll end up locking up the cluster, but I don't care so much if it just
> kills a single node.
> 

IF YOU CAN TOLERATE DATA INCONSISTENCIES, You can stop a node, delete some 
sstables, and start it again. The risk in deleting arbitrary sstables is that 
you may remove a tombstone and bring data back to life, or remove the only 
replica with a write if you write at CL:ONE, but if you're OK with data going 
missing, you won't hurt much as long as you stop cassandra before you go 
killing sstables.

TWCS does make this easier, because you can use sstablemetadata to identify 
timestamps/tombstone %s, and then nuke sstables that are old/mostly-expired 
first.


> Is doing something like deleting SSTables from disk possible? If I alter
> this keyspace from an RF of 2 down to 1 and then delete them, they won't be
> able to be repaired if I'm thinking this through right.
> 

If you drop RF from 2 to 1, you can just run cleanup and delete half the data 
(though it'll rewrite sstables to do it, which will be a short term increase).


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: index_interval

2017-07-12 Thread Jeff Jirsa


On 2017-07-12 12:03 (-0700), Fay Hou [Storage Service] ­  
wrote: 
> First, a big thank to Jeff who spent endless time to help this mailing list.
> Agreed that we should tune the key cache. In my case, my key cache hit rate
> is about 20%. mainly because we do random read. We just going to leave the
> index_interval as is for now.
> 

That's pretty painful. If you can up that a bit, it'll probably help you out. 
You can adjust the index intervals, too, but I'd significantly increase key 
cache size first if it were my cluster.


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: index_interval

2017-07-12 Thread Fay Hou [Storage Service] ­
First, a big thank to Jeff who spent endless time to help this mailing list.
Agreed that we should tune the key cache. In my case, my key cache hit rate
is about 20%. mainly because we do random read. We just going to leave the
index_interval as is for now.

On Mon, Jul 10, 2017 at 8:47 PM, Jeff Jirsa  wrote:

>
>
> On 2017-07-10 15:09 (-0700), Fay Hou [Storage Service] ­ <
> fay...@coupang.com> wrote:
> > BY defaults:
> >
> > AND max_index_interval = 2048
> > AND memtable_flush_period_in_ms = 0
> > AND min_index_interval = 128
> >
> > "Cassandra maintains index offsets per partition to speed up the lookup
> > process in the case of key cache misses (see cassandra read path overview
> >  dml_about_reads_c.html>).
> > By default it samples a subset of keys, somewhat similar to a skip list.
> > The sampling interval is configurable with min_index_interval and
> > max_index_interval CQL schema attributes (see describe table). For
> > relatively large blobs like HTML pages we seem to get better read
> latencies
> > by lowering the sampling interval from 128 min / 2048 max to 64 min / 512
> > max. For large tables like parsoid HTML with ~500G load per node this
> > change adds a modest ~25mb off-heap memory."
> >
> > I wonder if any one has experience on working with max and min
> index_interval
> > to increase the read speed.
>
> It's usually more efficient to try to tune the key cache, and hope you
> never have to hit the partition index at all. Do you have reason to believe
> you're spending an inordinate amount of IO scanning the partition index? Do
> you know what your key cache hit rate is?
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


RE: SASI and secondary index simultaniously

2017-07-12 Thread Jacques-Henri Berthemet
Hi,

According to SASI source code (3.11.0) it will always have priority over 
regular secondary index:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/SASIIndex.java#L234






public long getEstimatedResultRows()


{


// this is temporary (until proper QueryPlan is integrated into 
Cassandra)


// and allows us to priority SASI indexes if any in the query since they


// are going to be more efficient, to query and intersect, than 
built-in indexes.


return Long.MIN_VALUE;


}


I see that index building progress is reported as a CompactionInfo task so you 
should be able to monitor progress using ‘nodetool compactionstats’. Last 
point, from the moment that SASI index is created it will be used over regular 
index so I think you could drop the regular as soon as it is created it will 
make no difference. It also means that you may miss results until it is fully 
built.

Note that I may be wrong, I’m just reading sources as I’m working on a custom 
index.

--
Jacques-Henri Berthemet

From: Vlad [mailto:qa23d-...@yahoo.com.INVALID]
Sent: mercredi 12 juillet 2017 08:56
To: User cassandra.apache.org 
Subject: SASI and secondary index simultaniously

Hi,

it's possible to create both regular secondary index and SASI on the same 
column:

CREATE TABLE ks.tb (id int PRIMARY KEY,  name text);
CREATE CUSTOM INDEX tb_name_idx_1 ON ks.tb (name) USING 
'org.apache.cassandra.index.sasi.SASIIndex';
CREATE INDEX tb_name_idx ON ks.tb (name);
But which one is used for SELECT? Assuming we have regular index and would like 
to migrate to SASI, can we first create SASI, than drop regular? And how can we 
check then index build is completed?

Thanks.




Re: SASI and secondary index simultaniously

2017-07-12 Thread DuyHai Doan
In the original source code Sasi will be chosen instead of secondary index

Le 12 juil. 2017 09:13, "Vlad"  a écrit :

> Hi,
>
> it's possible to create both regular secondary index and SASI on the same
> column:
>
>
>
>
> *CREATE TABLE ks.tb (id int PRIMARY KEY,  name text);CREATE CUSTOM INDEX
> tb_name_idx_1 ON ks.tb (name) USING
> 'org.apache.cassandra.index.sasi.SASIIndex';CREATE INDEX tb_name_idx ON
> ks.tb (name);*
> But which one is used for SELECT? Assuming we have regular index and would
> like to migrate to SASI, can we first create SASI, than drop regular? And
> how can we check then index build is completed?
>
> Thanks.
>
>
>


Re: reduced num_token = improved performance ??

2017-07-12 Thread Chris Lohfink
Probably worth mentioning that some operational procedures like repairs,
bootstrapping etc are helped massively by using less tokens. Incremental
repairs are one of the things I would say is most impacted the by it since
less tokens will mean less local ranges to iterate through and less anti
compaction. I would highly recommend using far less than 256 in 3.x.

Chris

On Tue, Jul 11, 2017 at 8:36 PM, Justin Cameron 
wrote:

> Hi,
>
> Using fewer vnodes means you'll have a higher chance of hot spots in your
> cluster. Hot spots in Cassandra are nodes that, by random chance, are
> responsible for a higher percentage of the token space than others. This
> means they will receive more data and also more traffic/load than other
> nodes in the cluster.
>
> CASSANDRA-7032 goes a long way towards addresses this issue by allocating
> vnode tokens more intelligently, rather than just randomly assigning them.
> If you're using a version of Cassandra that contains this feature (3.0+),
> you can use a smaller number of vnodes in your cluster.
>
> A high number of vnodes won't affect performance for most Cassandra
> workloads, but if you're running tasks that need to do token-range scans
> (such as Spark), there is usually a significant performance hit.
>
> If you're on C* 3.0+ and are using Spark (or similar workloads - cassandra
> lucene index plugin is also affected) then I'd recommend using fewer vnodes
> - 16 would be ok. You'll probably still see some variance in token-space
> ownership between nodes, but the trade-off for better Spark performance
> will likely be worth it.
>
> Justin
>
> On Wed, 12 Jul 2017 at 00:34 ZAIDI, ASAD A  wrote:
>
>> Hi Folks,
>>
>>
>>
>> Pardon me if I’m missing  something obvious.  I’m still using
>> apache-cassandra 2.2 and planning for upgrade to  3.x.
>>
>> I came across this jira [https://issues.apache.org/
>> jira/browse/CASSANDRA-7032] that suggests reducing num_token may improve
>> general performance of Cassandra like having  num_token=16 instead of 256
>>   may help!
>>
>>
>>
>> Can you please suggests if having less num_token would provide real
>> performance benefits or if  it comes with any downsides that we should also
>> consider? I’ll much appreciate your insights.
>>
>>
>>
>> Thank you
>>
>> Asad
>>
> --
>
>
> *Justin Cameron*Senior Software Engineer
>
>
> 
>
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>


SASI and secondary index simultaniously

2017-07-12 Thread Vlad
Hi,
it's possible to create both regular secondary index and SASI on the same 
column:
CREATE TABLE ks.tb (id int PRIMARY KEY,  name text);
CREATE CUSTOM INDEX tb_name_idx_1 ON ks.tb (name) USING 
'org.apache.cassandra.index.sasi.SASIIndex';
CREATE INDEX tb_name_idx ON ks.tb (name);

 But which one is used for SELECT? Assuming we have regular index and would 
like to migrate to SASI, can we first create SASI, than drop regular? And how 
can we check then index build is completed?
Thanks.




Re: c* updates not getting reflected.

2017-07-12 Thread techpyaasa .
Hi Carlos Rolo

Using LOCAL_QUORUM for both writes & reads.
I see there is a time difference of 2 mins among nodes, I think that could
be the reason.
Anyways thanks for replying Carlos Rolo...
Have a nice day... :)

On Wed, Jul 12, 2017 at 12:45 AM, Carlos Rolo  wrote:

> What consistency are you using on those queries?
>
> On 11 Jul 2017 19:09, "techpyaasa ."  wrote:
>
>> Hi,
>>
>> We have a table with following schema:
>>
>> CREATE TABLE ks1.cf1 ( pid bigint, cid bigint, resp_json text, status
>> int, PRIMARY KEY (pid, cid) ) WITH CLUSTERING ORDER BY (cid ASC) with LCS
>> compaction strategy.
>>
>> We make very frequent updates to this table with query like
>>
>> UPDATE ks1.cf1 SET status = 0 where pid=1 and cid=1;
>> UPDATE ks1.cf1 SET resp_json='' where uid=1 and mid=1;
>>
>>
>> Now we seeing a strange issue like sometimes status column or resp_json
>> column value not getting updated when we query using SELECT query.
>>
>> We are not seeing any exceptions though during UPDATE query executions.
>> And also is there any way to make sure that last UPDATE was success??
>>
>> We are using c* - 2.1.17 , datastax java driver 2.1.18.
>>
>> Can someone point out what the issue is or anybody faced such strange
>> issue?
>>
>> Any help is appreciated.
>>
>> Thanks in advance
>> TechPyaasa
>>
>
> --
>
>
>
>


Re: Data Model Suggestion Required

2017-07-12 Thread Siddharth Prakash Singh
Thanks Jeff for suggestions.

On Mon, Jul 10, 2017 at 9:50 PM Jeff Jirsa  wrote:

>
>
> On 2017-07-10 07:13 (-0700), Siddharth Prakash Singh 
> wrote:
> > I am planning to build a user activity timeline. Users on our system
> > generates different kind of activity. For example - Search some product,
> > Calling our sales team, Marking favourite etc.
> > Now I would like to generate timeline based on these activities. Timeline
> > could be for all events, filtered on specific set of events, filtered on
> > time interval, filtered on specific set of events between time intervals.
> > Composite column keys looks like a viable solution.
> >
> > Any other thoughts here?
> >
>
> You probably want to take advantage of multiple/compound clustering keys,
> at least one of which being a timeuuid to give yourself ordering, and one
> giving you a 'type' of event.
>
> CREATE TABLE whatever (
> product_id uuid ,
> event_type text,
> event_id timeuuid,
> event_action text,
> event_data text,
> PRIMARY KEY(product_id, event_id, event_type, event_action, event_data));
>
> This will let you do "SELECT * FROM whatever WHERE product_id=?" and get
> all of the events, sorted by time, then by type, then you can have another
> unique "action", and finally a data field where you can shove your blob of
> whatever it is.  This would let you do time slices by specifying "event_id
> >= X and event_id < Y", but you'd need (want) to filter event_type client
> side.
>
> Alternatively, PRIMARY KEY(product_id, event_type, event_id, event_action,
> event_data) would let you do event_type=X and event_id >= Y and event_id <
> Z, which is all events of a given type within a slice.
>
> "product_id" may not be the natural partition key, feel free to use a
> compound partition key as well (may be "PRIMARY KEY((product_id,
> office_id), event_type, event_id, event_action, event_data)" to make a
> partition-per-office, as a silly example.
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>