RE: range queries on partition key supported?

2018-01-31 Thread Tyagi, Preetika
Thank you, Kurt. Just one more clarification.

And, then entire partition on each node will be searched based on the
> clustering key (i.e. "time" in this case).

No. it will skip to the section of the partition with time = '12:00'.
Cassandra should be smart enough to avoid reading the whole partition.

Yeah, that seems to correct. I probably didn't phrase it correctly.

Now let's assume a specific node is selected based on the token range and we 
need to look up for the data with time='12:00' within the partition which was 
obviously within token range.
Now on this node, there may be more than one partitions (let's take two 
partitions for example) which qualify for this token range. In that case, these 
two partitions will need to be looked up to get the data with the given time = 
12:00.
So I'm wondering how these two partitions will be looked up on this node. How 
the request query would look like on this node to get these partitions?
Does it make sense? Do you think I'm missing something?

Thanks,
Preetika

-Original Message-
From: kurt greaves [mailto:k...@instaclustr.com] 
Sent: Wednesday, January 31, 2018 9:46 PM
To: dev@cassandra.apache.org
Subject: Re: range queries on partition key supported?

>
> So that means more than one nodes can be selected to fulfill a range 
> query based on the token, correct?


Yes. When doing a token range query Cassandra will need to send requests to any 
node that owns part of the token range requested. This could be just one set of 
replicas or more, depending on how your token ring is arranged.
You could avoid querying multiple nodes by limiting the token() calls to be 
within one token range.

And, then entire partition on each node will be searched based on the
> clustering key (i.e. "time" in this case).

No. it will skip to the section of the partition with time = '12:00'.
Cassandra should be smart enough to avoid reading the whole partition.


On 31 January 2018 at 06:57, Tyagi, Preetika 
wrote:

> So that means more than one nodes can be selected to fulfill a range 
> query based on the token, correct?
>
> I was looking at this link: https://www.datastax.com/dev/ 
> blog/a-deep-look-to-the-cql-where-clause
>
> In the example query,
> SELECT * FROM numberOfRequests
> WHERE token(cluster, date) > token('cluster1', '2015-06-03')
> AND token(cluster, date) <= token('cluster1', '2015-06-05')
> AND time = '12:00'
>
> More than one nodes might get picked for this token based range query.
> And, then entire partition on each node will be searched based on the 
> clustering key (i.e. "time" in this case).
> Is my understanding correct?
>
> Thanks,
> Preetika
>
> -Original Message-
> From: J. D. Jordan [mailto:jeremiah.jor...@gmail.com]
> Sent: Tuesday, January 30, 2018 10:13 AM
> To: dev@cassandra.apache.org
> Subject: Re: range queries on partition key supported?
>
> A range query can be performed on the token of a partition key, not on 
> the value.
>
> -Jeremiah
>
> > On Jan 30, 2018, at 12:21 PM, Tyagi, Preetika 
> > 
> wrote:
> >
> > Hi All,
> >
> > I have a quick question on Cassandra's behavior in case of partition
> keys. I know that range queries are allowed in general, however, is it 
> also allowed on partition keys as well? The partition key is used as 
> an input to determine a node in a cluster, so I'm wondering how one 
> can possibly perform range query on that.
> >
> > Thanks,
> > Preetika
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


RE: create branch in my github account

2018-01-31 Thread Tyagi, Preetika
Thank you Michael. I was able to create the branch and push my changes! :)

Preetika

-Original Message-
From: Michael Shuler [mailto:mshu...@pbandjelly.org] On Behalf Of Michael Shuler
Sent: Tuesday, January 30, 2018 2:04 PM
To: dev@cassandra.apache.org
Subject: Re: create branch in my github account

On 01/30/2018 03:47 PM, Tyagi, Preetika wrote:
> Hi all,
> 
> I'm working on the JIRA ticket CASSANDRA-13981 and pushed a patch 
> yesterday, however, I have been suggested to create a branch in my 
> github account and then push all changes into that. The patch is too 
> big hence this seems to be a better approach. I haven't done it before 
> so wanted to ensure I do it correctly without messing things up :)
> 
> 
> 1.  On Cassandra GitHub: https://github.com/apache/cassandra,
> click on "Fork" to create my own copy in my account.
> 
> 2.  Git clone on the forked branch above

s/branch/repository/ - this is a new forked repo, not a branch

> 3.  Git checkout 

git checkout trunk
  # since 13981 appears to for 4.0 (trunk)
  # if you worked off some random sha, you may need to rebase on
  # trunk HEAD, otherwise it may not cleanly merge and that will be
  # the first patch review request.

git checkout -b CASSANDRA-13981
  # create a new branch

> 4.  Apply my patch
> 
> 5.  Git commit -m ""
> 
> 6.  Git push origin trunk

git push origin CASSANDRA-13981  # push a new branch to your fork

> Please let me know if you notice any issues. Thanks for your help!

You could do this in your fork on the trunk repository, but it's probably 
better to create a new branch, so you can fetch changes from the upstream trunk 
branch and rebase your branch, if that is needed. It is very common to have a 
number of remotes configured in your local
repository: one for your fork, one for the apache upstream, ones for other 
user's forks, etc. If you do your work directly in your trunk branch, you'll 
have conflicts when pulling in new commits from apache/cassandra trunk, for 
example.

--
Michael

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: range queries on partition key supported?

2018-01-31 Thread kurt greaves
>
> So that means more than one nodes can be selected to fulfill a range query
> based on the token, correct?


Yes. When doing a token range query Cassandra will need to send requests to
any node that owns part of the token range requested. This could be just
one set of replicas or more, depending on how your token ring is arranged.
You could avoid querying multiple nodes by limiting the token() calls to be
within one token range.

And, then entire partition on each node will be searched based on the
> clustering key (i.e. "time" in this case).

No. it will skip to the section of the partition with time = '12:00'.
Cassandra should be smart enough to avoid reading the whole partition.


On 31 January 2018 at 06:57, Tyagi, Preetika 
wrote:

> So that means more than one nodes can be selected to fulfill a range query
> based on the token, correct?
>
> I was looking at this link: https://www.datastax.com/dev/
> blog/a-deep-look-to-the-cql-where-clause
>
> In the example query,
> SELECT * FROM numberOfRequests
> WHERE token(cluster, date) > token('cluster1', '2015-06-03')
> AND token(cluster, date) <= token('cluster1', '2015-06-05')
> AND time = '12:00'
>
> More than one nodes might get picked for this token based range query.
> And, then entire partition on each node will be searched based on the
> clustering key (i.e. "time" in this case).
> Is my understanding correct?
>
> Thanks,
> Preetika
>
> -Original Message-
> From: J. D. Jordan [mailto:jeremiah.jor...@gmail.com]
> Sent: Tuesday, January 30, 2018 10:13 AM
> To: dev@cassandra.apache.org
> Subject: Re: range queries on partition key supported?
>
> A range query can be performed on the token of a partition key, not on the
> value.
>
> -Jeremiah
>
> > On Jan 30, 2018, at 12:21 PM, Tyagi, Preetika 
> wrote:
> >
> > Hi All,
> >
> > I have a quick question on Cassandra's behavior in case of partition
> keys. I know that range queries are allowed in general, however, is it also
> allowed on partition keys as well? The partition key is used as an input to
> determine a node in a cluster, so I'm wondering how one can possibly
> perform range query on that.
> >
> > Thanks,
> > Preetika
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Cassandra Monthly Dev Roundup: Jan 2018 Edition

2018-01-31 Thread Jeff Jirsa
Happy 2018 Cassandra Developers,

I hope you all had a good holiday season. In going through some of the
tickets/emails, I'm pretty happy - we had some contributions from some big
and interesting companies I didn't even realize were using Cassandra, and
that's always fun to see [1].

If you haven't had time to keep up with hot issues this month, there's a
few hot topics that will cause us to issue a release in the very near
future:

1) https://issues.apache.org/jira/browse/CASSANDRA-14092
- We store TTLs as 32 bit ints, we cap users at 20 year TTLs. If you set a
TTL to 20 years, that started to overflow the 32 bit int not long ago.
That's bad. Different versions have different impact, from annoying to very
bad. We'll probably cut a release as soon as this is done. There's some
active conversation in the list and on that JIRA - you should read it if
you care about how we handle data when we find a negative timestamp on disk
(read: there's some disagreement, if you have an opinion, chime in).

2) https://issues.apache.org/jira/browse/CASSANDRA-14173
- The JMX auth stuff used some JDK internals. Those JDK internals changed
with JDK8u161. Sam has a new patch, ready to commit. This probably will get
more and more attention as more and more people upgrade to the newest JDK
and find out Cassandra doesnt start

In terms of big / interesting commits that landed since the last email:

CASSANDRA-7544 Configurable storage port per node. Huge patch, you probably
care about this if you ever tried to run multiple instances of cassandra on
one IP (like on a laptop), or on different ports in a given cluster (port
7000 on some hosts, and 7001 on others), or similar.

CASSANDRA-14134 upgraded dtests to python3, getting rid of old dependencies
on pycassa (unmaintained), an ancient version of thrift, etc. Another huge
patch, if you're developing locally and running dtests yourself, you now
need python3. Some extra good news - docs are now much improved.

CASSANDRA-14190 is a patch from a new contributor that did something most
operators probably really wish existed 10 years ago - "nodetool
reloadseeds". Really should have existed long ago.

CASSANDRA-9067 speed up bloom filter serialization by 3-7x

CASSANDRA-13867 isn't flashy, but is another step in making more things
immutable for safety - huge patch for PartitionUpdate and Mutation, for
those of you who pay attention to the deep, dark internals.


On the mailing list, a user asked about plans for CDC. If you have an
opinion, it's not too late to chime in:
https://lists.apache.org/thread.html/aaa82c7dab534c3a35cfd1c4a082cb3a8f6bbf97e3efe960fa2342d0@%3Cdev.cassandra.apache.org%3E

Patches that could use reviews:
- https://issues.apache.org/jira/browse/CASSANDRA-14205 (Missing CQL
reserved keywords)
- https://issues.apache.org/jira/browse/CASSANDRA-14201 (new options to
nodetool verify)
- https://issues.apache.org/jira/browse/CASSANDRA-14204 (nodetool
garbagecollect assertion error)
- https://issues.apache.org/jira/browse/CASSANDRA-13981 (changes for
running on systems with persistent memory)
- https://issues.apache.org/jira/browse/CASSANDRA-14197 (more automatic
upgradesstables)
- https://issues.apache.org/jira/browse/CASSANDRA-14176 (2 line python fix
for making COPY work)
- https://issues.apache.org/jira/browse/CASSANDRA-14102 (transparent data
encryption)
- https://issues.apache.org/jira/browse/CASSANDRA-14107 (key rotation for
transparent data encryption)
- https://issues.apache.org/jira/browse/CASSANDRA-14160 (speeding up
compaction by keeping overlapping sstables ordered by time)
- https://issues.apache.org/jira/browse/CASSANDRA-12763 (make compaction
much faster for cases with lots of sstables)
- https://issues.apache.org/jira/browse/CASSANDRA-14126 (fixing javascript
UDFs)
- https://issues.apache.org/jira/browse/CASSANDRA-14070 (exposing primary
key column values in a different way)

I'd like to pretend that that's all the patch-available-needing-review
tickets, but I'd be lying - there's a LOT of patches waiting for reviews.
If you're able, please review a ticket this week. I'll personally buy you a
drink next time I bump into you if you do it and remind me about it.

Until February,
- Jeff



Footnote 1: I'm super tempted to name them, but I know some companies don't
like the attention, and I don't want everyone to feel like they have to
post with personal emails.


Re: CDC usability and future development

2018-01-31 Thread Josh McKenzie
>
> CDC provides only the mutation as opposed to the full column value, which
> tends to be of limited use for us. Applications might want to know the full
> column value, without having to issue a read back. We also see value in
> being able to publish the full column value both before and after the
> update. This is especially true when deleting a column since this stream
> may be joined with others, or consumers may require other fields to
> properly process the delete.


Philosophically, my first pass at the feature prioritized minimizing impact
to node performance first and usability second, punting a lot of the
de-duplication and RbW implications of having full column values, or
materializing stuff off-heap for consumption from a user and flagging as
persisted to disk etc, for future work on the feature. I don't personally
have any time to devote to moving the feature forward now but as Jeff
indicates, Jay and Simon are both active in the space and taking up the
torch.


On Tue, Jan 30, 2018 at 8:35 PM, Jeff Jirsa  wrote:

> Here's a deck of some proposed additions, discussed at one of the NGCC
> sessions last fall:
>
> https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf
>
>
>
> On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme  wrote:
>
> > Hi all,
> >
> > We are currently designing a system that allows our Cassandra clusters to
> > produce a stream of data updates. Naturally, we have been evaluating if
> CDC
> > can aid in this endeavor. We have found several challenges in using CDC
> for
> > this purpose.
> >
> > CDC provides only the mutation as opposed to the full column value, which
> > tends to be of limited use for us. Applications might want to know the
> full
> > column value, without having to issue a read back. We also see value in
> > being able to publish the full column value both before and after the
> > update. This is especially true when deleting a column since this stream
> > may be joined with others, or consumers may require other fields to
> > properly process the delete.
> >
> > Additionally, there is some difficulty with processing CDC itself such
> as:
> > - Updates not being immediately available (addressed by CASSANDRA-12148)
> > - Each node providing an independent streams of updates that must be
> > unified and deduplicated
> >
> > Our question is, what is the vision for CDC development? The current
> > implementation could work for some use cases, but is a ways from a
> general
> > streaming solution. I understand that the nature of Cassandra makes this
> > quite complicated, but are there any thoughts or desires on the future
> > direction of CDC?
> >
> > Thanks
> >
> >
>