Re: Issues, understanding how CQL works

2020-04-24 Thread Marc Richter

Hi List,

I just wanted to share what I learned due to my investigations in the 
meantime here, including the solution to my initial question:


  1)
  "I need to know the latest "fromtime" line in the whole DB"

  2)
  "I need a way to learn all values that are in the "monthyear
   and "signalid" columns ..."

To answer 2) first: This turns out to be one single Query:


SELECT
  DISTINCT signalid,monthyear
FROM
  central


This provides a list consisting of just these two value pairs, but all 
of them.


With these lists available, it is also possible to find the "highest" 
fromtime - lines per pair with the following query:



SELECT
  max(fromtime) AS maxfromtime
FROM central
WHERE signalid={signalid} AND monthyear={monthyear}


Wrapping this in a simple "for"-loop in any programming language, this 
results in a - comparatively - short list, from which finding the 
"highest" fromtime - integer is easy.


Thanks for everyone who helped me!

Best regards,
Marc Richter




On 22.04.20 17:57, Marc Richter wrote:

Seems as if sstable2json is deprecated; see [1] and [2].

So, dsbulk [3] it is, I guess.

I downloaded it and crafted the following commandline from the docs [4] 
for my use case:


$ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \
   --driver.advanced.auth-provider.class PlainTextAuthProvider \
   -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \
   --connector.json.compression gzip -url /path/to/big/storage

This seems to result in multiple JSON files compressed with GZIP; seems 
to be exactly what I needed to help me in this case!


There's only one thing that I do not really understand what it means:
Besides the GZIP archives, it also creates two logfiles. One of them 
(unload-errors.log) contains some Java stacks. I do not understand what 
those lines are supposed to say:


(Added it to pastebin to not render the mail unreadable):

https://pastebin.com/WpYvqxAA

What are those lines supposed to tell me?

Best regards,
Marc Richter




[1] 
https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html 


[2] https://issues.apache.org/jira/browse/CASSANDRA-9618
[3] https://downloads.datastax.com/#bulk-loader
[4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html

On 22.04.20 16:15, Marc Richter wrote:

This sounds like a promising way; thank you for bringing this up!

I will see if I can manage it with this approach.

Best regards,
Marc Richter



On 22.04.20 15:38, Durity, Sean R wrote:
I thought this might be a single-time use case request. I think my 
first approach would be to use something like dsbulk to unload the 
data and then reload it into a table designed for the query you want 
to do (as long as you have adequate disk space). I think like a 
DBA/admin first. Dsbulk creates csv files, so you could move that 
data to any kind of database, if you chose.


An alternative approach would be to use a driver that supports paging 
(I think this would be most of them) and write a program to walk the 
data set and output what you need in whatever format you need.


Or, since this is a single node scenario, you could try sstable2json 
to export the sstables (files on disk) into JSON, if that is a more 
workable format for you.


Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: Marc Richter 
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

  From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single

Re: Issues, understanding how CQL works

2020-04-22 Thread Alex Ott
not directly related, but you can try to use zstd as compression - in my
tests it performed faster offload, with slightly worse compression ratio

Marc Richter  at "Wed, 22 Apr 2020 17:57:44 +0200" wrote:
 MR> Seems as if sstable2json is deprecated; see [1] and [2].

 MR> So, dsbulk [3] it is, I guess.

 MR> I downloaded it and crafted the following commandline from the docs [4] 
for my use case:

 MR> $ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \
 MR>   --driver.advanced.auth-provider.class PlainTextAuthProvider \
 MR>   -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \
 MR>   --connector.json.compression gzip -url /path/to/big/storage

 MR> This seems to result in multiple JSON files compressed with GZIP; seems to 
be exactly what
 MR> I needed to help me in this case!

 MR> There's only one thing that I do not really understand what it means:
 MR> Besides the GZIP archives, it also creates two logfiles. One of them 
(unload-errors.log)
 MR> contains some Java stacks. I do not understand what those lines are 
supposed to say:

 MR> (Added it to pastebin to not render the mail unreadable):

 MR> https://pastebin.com/WpYvqxAA

 MR> What are those lines supposed to tell me?
 MR> Marc Richter

 MR> [1] 
https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html
 MR> [2] https://issues.apache.org/jira/browse/CASSANDRA-9618
 MR> [3] https://downloads.datastax.com/#bulk-loader
 MR> [4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html

 MR> On 22.04.20 16:15, Marc Richter wrote:
 >> This sounds like a promising way; thank you for bringing this up!
 >>
 >> I will see if I can manage it with this approach.
 >>
 >> Best regards,
 >> Marc Richter
 >>
 >>
 >>
 >> On 22.04.20 15:38, Durity, Sean R wrote:
 >>> I thought this might be a single-time use case request. I think my first 
 >>> approach would
 >>> be to use something like dsbulk to unload the data and then reload it into 
 >>> a table
 >>> designed for the query you want to do (as long as you have adequate disk 
 >>> space). I
 >>> think like a DBA/admin first. Dsbulk creates csv files, so you could move 
 >>> that data to
 >>> any kind of database, if you chose.
 >>>
 >>> An alternative approach would be to use a driver that supports paging (I 
 >>> think this
 >>> would be most of them) and write a program to walk the data set and output 
 >>> what you
 >>> need in whatever format you need.
 >>>
 >>> Or, since this is a single node scenario, you could try sstable2json to 
 >>> export the
 >>> sstables (files on disk) into JSON, if that is a more workable format for 
 >>> you.
 >>>
 >>> Sean Durity – Staff Systems Engineer, Cassandra
 >>>
 >>> -Original Message-
 >>> From: Marc Richter 
 >>> Sent: Wednesday, April 22, 2020 6:22 AM
 >>> To: user@cassandra.apache.org
 >>> Subject: [EXTERNAL] Re: Issues, understanding how CQL works
 >>>
 >>> Hi Jeff,
 >>>
 >>> thank you for your exhaustive and verbose answer!
 >>> Also, a very big "Thank you!" to all the other replyers; I hope you
 >>> understand that I summarize all your feedback in this single answer.
 >>>
 >>>   From what I understand from your answers, Cassandra seems to be
 >>> optimized to store (and read) data in only exactly that way that the
 >>> data structure has been designed for. That makes it very inflexible, but
 >>> allows it to do that single job very effectively for a trade-off.
 >>>
 >>> I also understand, the more I dig into Cassandra, that the team I am
 >>> supporting is using Cassandra kind of wrong; they for example do have
 >>> only one node and so do not use neither the load-balancing, nor the
 >>> redundancy-capabilities Cassandra offers.
 >>> Thus, maybe relevant side-note: All the data resides on just one single
 >>> node; maybe that info is important, because we know on which node the
 >>> data is (I know that Cassandra internally is applying the same Hashing -
 >>> Voodoo as if there were 1k nodes, but maybe this is important anyways).
 >>>
 >>> Anyways: I do not really care if a query or effort to find this
 >>> information is sub-optimal or very "expensive" in means of effectivity
 >>> or system-load, since this isn't something that I need to extract on a
 >>> regular basis, but only once. Due to that, it doesn't need to be optimal
 >>> or effective; I also 

Re: Issues, understanding how CQL works

2020-04-22 Thread Marc Richter

Seems as if sstable2json is deprecated; see [1] and [2].

So, dsbulk [3] it is, I guess.

I downloaded it and crafted the following commandline from the docs [4] 
for my use case:


$ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \
  --driver.advanced.auth-provider.class PlainTextAuthProvider \
  -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \
  --connector.json.compression gzip -url /path/to/big/storage

This seems to result in multiple JSON files compressed with GZIP; seems 
to be exactly what I needed to help me in this case!


There's only one thing that I do not really understand what it means:
Besides the GZIP archives, it also creates two logfiles. One of them 
(unload-errors.log) contains some Java stacks. I do not understand what 
those lines are supposed to say:


(Added it to pastebin to not render the mail unreadable):

https://pastebin.com/WpYvqxAA

What are those lines supposed to tell me?

Best regards,
Marc Richter




[1] 
https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html

[2] https://issues.apache.org/jira/browse/CASSANDRA-9618
[3] https://downloads.datastax.com/#bulk-loader
[4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html

On 22.04.20 16:15, Marc Richter wrote:

This sounds like a promising way; thank you for bringing this up!

I will see if I can manage it with this approach.

Best regards,
Marc Richter



On 22.04.20 15:38, Durity, Sean R wrote:
I thought this might be a single-time use case request. I think my 
first approach would be to use something like dsbulk to unload the 
data and then reload it into a table designed for the query you want 
to do (as long as you have adequate disk space). I think like a 
DBA/admin first. Dsbulk creates csv files, so you could move that data 
to any kind of database, if you chose.


An alternative approach would be to use a driver that supports paging 
(I think this would be most of them) and write a program to walk the 
data set and output what you need in whatever format you need.


Or, since this is a single node scenario, you could try sstable2json 
to export the sstables (files on disk) into JSON, if that is a more 
workable format for you.


Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: Marc Richter 
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

  From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single request. I really
need this info (most recent "insertdate") only once.
Is, considering this, a way to do that?

  > Because you didnt provide a signalid and monthyear, it doesn't know
  > which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But
what do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is it
possible to "re-arrange" the data in the cluster or something that
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data -
move, but I do not know about what is possible and how to do that.

Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:



On Tue, Apr 21, 2020 at 6:20 AM Marc Richter mailto:m...@marc-richter.info>> wrote:

 Hi everyone,

 I'm very new to Cassandra. I have, however, some experience with 
SQL.



The biggest thing to remember is that Cassandra is designed to scale out

Re: Issues, understanding how CQL works

2020-04-22 Thread Aakash Pandhi
Marc,
In DSE CQL offers option called CAPTURE, which can save output of query to a 
directed file. May be you can use that option to save all values you need in 
that file to see all signalids or whichever columns you need. File may grow big 
based on your dataset, so I am not sure what limit it imposes on file size. But 
if you are selecting 1 or 2 columns it should be fine I assume. 
Here is syntax
CAPTURE

| 
| 
|  | 
CAPTURE

Appends query results to a file.
 |

 |

 |


Sincerely,

Aakash Pandhi
 

On Wednesday, April 22, 2020, 08:38:38 AM CDT, Durity, Sean R 
 wrote:  
 
 I thought this might be a single-time use case request. I think my first 
approach would be to use something like dsbulk to unload the data and then 
reload it into a table designed for the query you want to do (as long as you 
have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv 
files, so you could move that data to any kind of database, if you chose.

An alternative approach would be to use a driver that supports paging (I think 
this would be most of them) and write a program to walk the data set and output 
what you need in whatever format you need.

Or, since this is a single node scenario, you could try sstable2json to export 
the sstables (files on disk) into JSON, if that is a more workable format for 
you.

Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: Marc Richter 
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

 From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single request. I really
need this info (most recent "insertdate") only once.
Is, considering this, a way to do that?

 > Because you didnt provide a signalid and monthyear, it doesn't know
 > which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But
what do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is it
possible to "re-arrange" the data in the cluster or something that
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data -
move, but I do not know about what is possible and how to do that.

Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:
>
>
> On Tue, Apr 21, 2020 at 6:20 AM Marc Richter  <mailto:m...@marc-richter.info>> wrote:
>
>    Hi everyone,
>
>    I'm very new to Cassandra. I have, however, some experience with SQL.
>
>
> The biggest thing to remember is that Cassandra is designed to scale out
> to massive clusters - like thousands of instances. To do that, you can't
> assume it's ever ok to read all of the data, because that doesn't scale.
> So cassandra takes shortcuts / optimizations to make it possible to
> ADDRESS all of that data, but not SCAN it.
>
>
>    I need to extract some information from a Cassandra database that has
>    the following table definition:
>
>    CREATE TABLE tagdata.central (
>    signalid int,
>    monthyear int,
>    fromtime bigint,
>    totime bigint,
>    avg decimal,
>    insertdate bigint,
>    max decimal,
>    min decimal,
>    readings text,
>    PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
>    )
>
>
> What your primary key REALLY MEANS is:
>
> The database on reads and writes will hash(signalid+monthy

Re: Issues, understanding how CQL works

2020-04-22 Thread Marc Richter

This sounds like a promising way; thank you for bringing this up!

I will see if I can manage it with this approach.

Best regards,
Marc Richter



On 22.04.20 15:38, Durity, Sean R wrote:

I thought this might be a single-time use case request. I think my first 
approach would be to use something like dsbulk to unload the data and then 
reload it into a table designed for the query you want to do (as long as you 
have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv 
files, so you could move that data to any kind of database, if you chose.

An alternative approach would be to use a driver that supports paging (I think 
this would be most of them) and write a program to walk the data set and output 
what you need in whatever format you need.

Or, since this is a single node scenario, you could try sstable2json to export 
the sstables (files on disk) into JSON, if that is a more workable format for 
you.

Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: Marc Richter 
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

  From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single request. I really
need this info (most recent "insertdate") only once.
Is, considering this, a way to do that?

  > Because you didnt provide a signalid and monthyear, it doesn't know
  > which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But
what do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is it
possible to "re-arrange" the data in the cluster or something that
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data -
move, but I do not know about what is possible and how to do that.

Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:



On Tue, Apr 21, 2020 at 6:20 AM Marc Richter mailto:m...@marc-richter.info>> wrote:

 Hi everyone,

 I'm very new to Cassandra. I have, however, some experience with SQL.


The biggest thing to remember is that Cassandra is designed to scale out
to massive clusters - like thousands of instances. To do that, you can't
assume it's ever ok to read all of the data, because that doesn't scale.
So cassandra takes shortcuts / optimizations to make it possible to
ADDRESS all of that data, but not SCAN it.


 I need to extract some information from a Cassandra database that has
 the following table definition:

 CREATE TABLE tagdata.central (
 signalid int,
 monthyear int,
 fromtime bigint,
 totime bigint,
 avg decimal,
 insertdate bigint,
 max decimal,
 min decimal,
 readings text,
 PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
 )


What your primary key REALLY MEANS is:

The database on reads and writes will hash(signalid+monthyear) to find
which hosts have the data, then

In each data file, the data for a given (signalid,monthyear) is stored
sorted by fromtime and totime

 The database is already of round about 260 GB in size.
 I now need to know what is the most recent entry in it; the correct
 column to learn this would be "insertdate".

 In SQL I would do something like this:

 SELECT insertdate FROM tagdata.central
 ORDER BY insertdate DESC LIMIT 1;

 In CQL, however, I just can't get it to work.

 

Re: Issues, understanding how CQL works

2020-04-22 Thread Alex Ott
DSBulk also works with JSON...
if transformations of data are complex, I would go with Spark running in
local mode, and process data...

On Wed, Apr 22, 2020 at 3:38 PM Durity, Sean R 
wrote:

> I thought this might be a single-time use case request. I think my first
> approach would be to use something like dsbulk to unload the data and then
> reload it into a table designed for the query you want to do (as long as
> you have adequate disk space). I think like a DBA/admin first. Dsbulk
> creates csv files, so you could move that data to any kind of database, if
> you chose.
>
> An alternative approach would be to use a driver that supports paging (I
> think this would be most of them) and write a program to walk the data set
> and output what you need in whatever format you need.
>
> Or, since this is a single node scenario, you could try sstable2json to
> export the sstables (files on disk) into JSON, if that is a more workable
> format for you.
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
> -Original Message-
> From: Marc Richter 
> Sent: Wednesday, April 22, 2020 6:22 AM
> To: user@cassandra.apache.org
> Subject: [EXTERNAL] Re: Issues, understanding how CQL works
>
> Hi Jeff,
>
> thank you for your exhaustive and verbose answer!
> Also, a very big "Thank you!" to all the other replyers; I hope you
> understand that I summarize all your feedback in this single answer.
>
>  From what I understand from your answers, Cassandra seems to be
> optimized to store (and read) data in only exactly that way that the
> data structure has been designed for. That makes it very inflexible, but
> allows it to do that single job very effectively for a trade-off.
>
> I also understand, the more I dig into Cassandra, that the team I am
> supporting is using Cassandra kind of wrong; they for example do have
> only one node and so do not use neither the load-balancing, nor the
> redundancy-capabilities Cassandra offers.
> Thus, maybe relevant side-note: All the data resides on just one single
> node; maybe that info is important, because we know on which node the
> data is (I know that Cassandra internally is applying the same Hashing -
> Voodoo as if there were 1k nodes, but maybe this is important anyways).
>
> Anyways: I do not really care if a query or effort to find this
> information is sub-optimal or very "expensive" in means of effectivity
> or system-load, since this isn't something that I need to extract on a
> regular basis, but only once. Due to that, it doesn't need to be optimal
> or effective; I also do not care if it blocks the node for several
> hours, since Cassandra is only working on this single request. I really
> need this info (most recent "insertdate") only once.
> Is, considering this, a way to do that?
>
>  > Because you didnt provide a signalid and monthyear, it doesn't know
>  > which machine in your cluster to use to start the query.
>
> I know this already; thanks for confirming that I got this correct! But
> what do I do then if I do not know all "signalid"s? How to learn them?
>
> Is it maybe possible to get a full list of all "signalid"s? Or is it
> possible to "re-arrange" the data in the cluster or something that
> enables me to learn what's the most recent "insertdate"?
> I really do not care if I need to do some expensive copy-all-data -
> move, but I do not know about what is possible and how to do that.
>
> Best regards,
> Marc Richter
>
> On 21.04.20 19:20, Jeff Jirsa wrote:
> >
> >
> > On Tue, Apr 21, 2020 at 6:20 AM Marc Richter  > <mailto:m...@marc-richter.info>> wrote:
> >
> > Hi everyone,
> >
> > I'm very new to Cassandra. I have, however, some experience with SQL.
> >
> >
> > The biggest thing to remember is that Cassandra is designed to scale out
> > to massive clusters - like thousands of instances. To do that, you can't
> > assume it's ever ok to read all of the data, because that doesn't scale.
> > So cassandra takes shortcuts / optimizations to make it possible to
> > ADDRESS all of that data, but not SCAN it.
> >
> >
> > I need to extract some information from a Cassandra database that has
> > the following table definition:
> >
> > CREATE TABLE tagdata.central (
> > signalid int,
> > monthyear int,
> > fromtime bigint,
> > totime bigint,
> > avg decimal,
> > insertdate bigint,
> > max decimal,
> > min decimal,
> > readings text,
> > PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> > )
> >
> >
> 

RE: Issues, understanding how CQL works

2020-04-22 Thread Durity, Sean R
I thought this might be a single-time use case request. I think my first 
approach would be to use something like dsbulk to unload the data and then 
reload it into a table designed for the query you want to do (as long as you 
have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv 
files, so you could move that data to any kind of database, if you chose.

An alternative approach would be to use a driver that supports paging (I think 
this would be most of them) and write a program to walk the data set and output 
what you need in whatever format you need.

Or, since this is a single node scenario, you could try sstable2json to export 
the sstables (files on disk) into JSON, if that is a more workable format for 
you.

Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: Marc Richter 
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

 From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single request. I really
need this info (most recent "insertdate") only once.
Is, considering this, a way to do that?

 > Because you didnt provide a signalid and monthyear, it doesn't know
 > which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But
what do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is it
possible to "re-arrange" the data in the cluster or something that
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data -
move, but I do not know about what is possible and how to do that.

Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:
>
>
> On Tue, Apr 21, 2020 at 6:20 AM Marc Richter  <mailto:m...@marc-richter.info>> wrote:
>
> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>
>
> The biggest thing to remember is that Cassandra is designed to scale out
> to massive clusters - like thousands of instances. To do that, you can't
> assume it's ever ok to read all of the data, because that doesn't scale.
> So cassandra takes shortcuts / optimizations to make it possible to
> ADDRESS all of that data, but not SCAN it.
>
>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
>
> What your primary key REALLY MEANS is:
>
> The database on reads and writes will hash(signalid+monthyear) to find
> which hosts have the data, then
>
> In each data file, the data for a given (signalid,monthyear) is stored
> sorted by fromtime and totime
>
> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get i

Re: Issues, understanding how CQL works

2020-04-22 Thread Marc Richter

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you 
understand that I summarize all your feedback in this single answer.


From what I understand from your answers, Cassandra seems to be 
optimized to store (and read) data in only exactly that way that the 
data structure has been designed for. That makes it very inflexible, but 
allows it to do that single job very effectively for a trade-off.


I also understand, the more I dig into Cassandra, that the team I am 
supporting is using Cassandra kind of wrong; they for example do have 
only one node and so do not use neither the load-balancing, nor the 
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single 
node; maybe that info is important, because we know on which node the 
data is (I know that Cassandra internally is applying the same Hashing - 
Voodoo as if there were 1k nodes, but maybe this is important anyways).


Anyways: I do not really care if a query or effort to find this 
information is sub-optimal or very "expensive" in means of effectivity 
or system-load, since this isn't something that I need to extract on a 
regular basis, but only once. Due to that, it doesn't need to be optimal 
or effective; I also do not care if it blocks the node for several 
hours, since Cassandra is only working on this single request. I really 
need this info (most recent "insertdate") only once.

Is, considering this, a way to do that?

> Because you didnt provide a signalid and monthyear, it doesn't know
> which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But 
what do I do then if I do not know all "signalid"s? How to learn them?


Is it maybe possible to get a full list of all "signalid"s? Or is it 
possible to "re-arrange" the data in the cluster or something that 
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data - 
move, but I do not know about what is possible and how to do that.


Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:



On Tue, Apr 21, 2020 at 6:20 AM Marc Richter > wrote:


Hi everyone,

I'm very new to Cassandra. I have, however, some experience with SQL.


The biggest thing to remember is that Cassandra is designed to scale out 
to massive clusters - like thousands of instances. To do that, you can't 
assume it's ever ok to read all of the data, because that doesn't scale. 
So cassandra takes shortcuts / optimizations to make it possible to 
ADDRESS all of that data, but not SCAN it.



I need to extract some information from a Cassandra database that has
the following table definition:

CREATE TABLE tagdata.central (
signalid int,
monthyear int,
fromtime bigint,
totime bigint,
avg decimal,
insertdate bigint,
max decimal,
min decimal,
readings text,
PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
)


What your primary key REALLY MEANS is:

The database on reads and writes will hash(signalid+monthyear) to find 
which hosts have the data, then


In each data file, the data for a given (signalid,monthyear) is stored 
sorted by fromtime and totime


The database is already of round about 260 GB in size.
I now need to know what is the most recent entry in it; the correct
column to learn this would be "insertdate".

In SQL I would do something like this:

SELECT insertdate FROM tagdata.central
ORDER BY insertdate DESC LIMIT 1;

In CQL, however, I just can't get it to work.

What I have tried already is this:

SELECT insertdate FROM "tagdata.central"
ORDER BY insertdate DESC LIMIT 1;


Because you didnt provide a signalid and monthyear, it doesn't know 
which machine in your cluster to use to start the query.



But this gives me an error:
ERROR: ORDER BY is only supported when the partition key is restricted
by an EQ or an IN.


Because it's designed for potentially petabytes of data per cluster, it 
doesn't believe you really want to walk all the data and order ALL of 
it. Instead, it assumes that when you need to use an ORDER BY, you're 
going to have some very small piece of data - confined to a single 
signalid/monthyear pair. And even then, the ORDER is going to assume 
that you're ordering it by the ordering keys you've defined - fromtime  
first, and then totime.


So you can do

  SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
And you can do

  SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC

And you can do ranges:

  SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY 
fromtime DESC


But you have to work within the boundaries of how the data is stored. 
It's stored grouped by signalid+monthyear, and then sorted by fromtime, 
and 

Re: Issues, understanding how CQL works

2020-04-22 Thread Pekka Enberg
Hi Marc,

On Tue, Apr 21, 2020 at 4:20 PM Marc Richter  wrote:

> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>

As others have already pointed out, you need to design your data model to
support the queries you need. CQL is not SQL and you cannot query the data
in arbitrary ways if your data model does not support it, at least not
efficiently.

Although the context is DynamoDB, I have found the following presentation
by Rick Houlihan to be pretty good on this topic and more or less
applicable to Cassandra. The part about "NoSQL data modeling" starts at
22:45 mark:

https://www.youtube.com/watch?v=HaEPXoXVf2k=youtu.be=1363

Hope this helps!

- Pekka


Re: Issues, understanding how CQL works

2020-04-21 Thread Voytek Jarnot
As I learned the hard way (and has already been implied), design your
tables to support your queries.

We have, for example, 9 tables storing the same data, because users wish to
query in different ways. Could be several more tables (if one was being a
purist), but indexes get us the rest of the way there.

On Tue, Apr 21, 2020 at 8:20 AM Marc Richter  wrote:

> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>
> What I have tried already is this:
>
> SELECT insertdate FROM "tagdata.central"
> ORDER BY insertdate DESC LIMIT 1;
>
> But this gives me an error:
> ERROR: ORDER BY is only supported when the partition key is restricted
> by an EQ or an IN.
>
> So, after some trial and error and a lot of Googling, I learned that I
> must include all rows from the PRIMARY KEY from left to right in my
> query. Thus, this is the "best" I can get to work:
>
>
> SELECT
> *
> FROM
> "tagdata.central"
> WHERE
> "signalid" = 4002
> AND "monthyear" = 201908
> ORDER BY
> "fromtime" DESC
> LIMIT 10;
>
>
> The "monthyear" column, I crafted like a fool by incrementing the date
> one month after another until no results could be found anymore.
> The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
> query results. But these can't be as easily guessed as the "monthyear"
> values could.
>
> This is where I'm stuck!
>
> 1. This does not really feel like the ideal way to go. I think there is
> something more mature in modern IT systems. Can anyone tell me what is a
> better way to get these informations?
>
> 2. I need a way to learn all values that are in the "monthyear" and
> "signalid" columns in order to be able to craft that query.
> How can I achieve that in a reasonable way? As I said: The DB is round
> about 260 GB which makes it next to impossible to just "have a look" at
> the output of "SELECT *"..
>
> Thanks for your help!
>
> Best regards,
> Marc Richter
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Issues, understanding how CQL works

2020-04-21 Thread Jeff Jirsa
On Tue, Apr 21, 2020 at 6:20 AM Marc Richter  wrote:

> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>

The biggest thing to remember is that Cassandra is designed to scale out to
massive clusters - like thousands of instances. To do that, you can't
assume it's ever ok to read all of the data, because that doesn't scale. So
cassandra takes shortcuts / optimizations to make it possible to ADDRESS
all of that data, but not SCAN it.


>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
>
What your primary key REALLY MEANS is:

The database on reads and writes will hash(signalid+monthyear) to find
which hosts have the data, then

In each data file, the data for a given (signalid,monthyear) is stored
sorted by fromtime and totime



> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>
> What I have tried already is this:
>
> SELECT insertdate FROM "tagdata.central"
> ORDER BY insertdate DESC LIMIT 1;
>

Because you didnt provide a signalid and monthyear, it doesn't know which
machine in your cluster to use to start the query.


>
> But this gives me an error:
> ERROR: ORDER BY is only supported when the partition key is restricted
> by an EQ or an IN.
>

Because it's designed for potentially petabytes of data per cluster, it
doesn't believe you really want to walk all the data and order ALL of it.
Instead, it assumes that when you need to use an ORDER BY, you're going to
have some very small piece of data - confined to a single
signalid/monthyear pair. And even then, the ORDER is going to assume that
you're ordering it by the ordering keys you've defined - fromtime  first,
and then totime.

So you can do

 SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
And you can do

 SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC

And you can do ranges:

 SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY
fromtime DESC

But you have to work within the boundaries of how the data is stored. It's
stored grouped by signalid+monthyear, and then sorted by fromtime, and then
sorted by totime.



> So, after some trial and error and a lot of Googling, I learned that I
> must include all rows from the PRIMARY KEY from left to right in my
> query. Thus, this is the "best" I can get to work:
>
>
> SELECT
> *
> FROM
> "tagdata.central"
> WHERE
> "signalid" = 4002
> AND "monthyear" = 201908
> ORDER BY
> "fromtime" DESC
> LIMIT 10;
>
>
> The "monthyear" column, I crafted like a fool by incrementing the date
> one month after another until no results could be found anymore.
> The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
> query results. But these can't be as easily guessed as the "monthyear"
> values could.
>
> This is where I'm stuck!
>
> 1. This does not really feel like the ideal way to go. I think there is
> something more mature in modern IT systems. Can anyone tell me what is a
> better way to get these informations?
>

You can denormalize. Because cassandra allows you to have very large
clusters, you can make multiple tables sorted in different ways to enable
the queries you need to run. Normal data modeling is to build tables based
on the SELECT statements you need to do (unless you're very advanced, in
which case you do it based on the transaction semantics of the
INSERT/UPDATE statements, but that's probably not you).

Or you can use a more flexible database.


>
> 2. I need a way to learn all values that are in the "monthyear" and
> "signalid" columns in order to be able to craft that query.
> How can I achieve that in a reasonable way? As I said: The DB is round
> about 260 GB which makes it next to impossible to just "have a look" at
> the output of "SELECT *"..
>

You probably want to keep another table of monthyear + signalid pairs.


Re: Issues, understanding how CQL works

2020-04-21 Thread Reid Pinchback
Marc, have you had any exposure to DynamoDB at all?  The API approach is 
different, but the fundamental concepts are similar.  That’s actually a better 
reference point to have than an RDBMS, because really it’s a small subset of 
usage patterns that would overlap with CQL.  If you were, for example, dealing 
with databases that did a lot of table partitions and supported apps that 
focused bulk loads and analytics on a partition level, then you would be in a 
space somewhat similar to C*.

C* is at its best when your common usage pattern, at least on reads, is 
effectively “I want a bunch of stuff, so you may as well give it to me by the 
bunch… what I do with the bunch after is my problem”.  That’s very different 
from an RDBMS, which historically has always tried to find some balance between 
minimizing disk I/O and network I/O… but if it takes developers a lot more head 
scratching to get there, it was considered an acceptable investment to help 
scale the usage of an expensive resource.

As a result, language features for the two cases are quite different.

From: Elliott Sims 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, April 21, 2020 at 12:13 PM
To: "user@cassandra.apache.org" 
Subject: Re: Issues, understanding how CQL works

Message from External Sender
The short answer is that CQL isn't SQL.  It looks a bit like it, but the 
structure of the data is totally different.  Essentially (ignoring secondary 
indexes, which have some issues in practice and I think are generally not 
recommended) the only way to look the data up is by the partition key.  
Anything else is a full-table scan and if you need more querying flexibility 
Cassandra is probably not your best option.   With only 260GB, I think I'd lean 
towards suggesting PostgreSQL or MySQL.

On Tue, Apr 21, 2020 at 7:20 AM Marc Richter 
mailto:m...@marc-richter.info>> wrote:
Hi everyone,

I'm very new to Cassandra. I have, however, some experience with SQL.

I need to extract some information from a Cassandra database that has
the following table definition:

CREATE TABLE tagdata.central (
signalid int,
monthyear int,
fromtime bigint,
totime bigint,
avg decimal,
insertdate bigint,
max decimal,
min decimal,
readings text,
PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
)

The database is already of round about 260 GB in size.
I now need to know what is the most recent entry in it; the correct
column to learn this would be "insertdate".

In SQL I would do something like this:

SELECT insertdate FROM tagdata.central
ORDER BY insertdate DESC LIMIT 1;

In CQL, however, I just can't get it to work.

What I have tried already is this:

SELECT insertdate FROM "tagdata.central"
ORDER BY insertdate DESC LIMIT 1;

But this gives me an error:
ERROR: ORDER BY is only supported when the partition key is restricted
by an EQ or an IN.

So, after some trial and error and a lot of Googling, I learned that I
must include all rows from the PRIMARY KEY from left to right in my
query. Thus, this is the "best" I can get to work:


SELECT
*
FROM
"tagdata.central"
WHERE
"signalid" = 4002
AND "monthyear" = 201908
ORDER BY
"fromtime" DESC
LIMIT 10;


The "monthyear" column, I crafted like a fool by incrementing the date
one month after another until no results could be found anymore.
The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
query results. But these can't be as easily guessed as the "monthyear"
values could.

This is where I'm stuck!

1. This does not really feel like the ideal way to go. I think there is
something more mature in modern IT systems. Can anyone tell me what is a
better way to get these informations?

2. I need a way to learn all values that are in the "monthyear" and
"signalid" columns in order to be able to craft that query.
How can I achieve that in a reasonable way? As I said: The DB is round
about 260 GB which makes it next to impossible to just "have a look" at
the output of "SELECT *"..

Thanks for your help!

Best regards,
Marc Richter


-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org<mailto:user-unsubscr...@cassandra.apache.org>
For additional commands, e-mail: 
user-h...@cassandra.apache.org<mailto:user-h...@cassandra.apache.org>


Re: Issues, understanding how CQL works

2020-04-21 Thread Elliott Sims
The short answer is that CQL isn't SQL.  It looks a bit like it, but the
structure of the data is totally different.  Essentially (ignoring
secondary indexes, which have some issues in practice and I think are
generally not recommended) the only way to look the data up is by the
partition key.  Anything else is a full-table scan and if you need more
querying flexibility Cassandra is probably not your best option.   With
only 260GB, I think I'd lean towards suggesting PostgreSQL or MySQL.

On Tue, Apr 21, 2020 at 7:20 AM Marc Richter  wrote:

> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>
> What I have tried already is this:
>
> SELECT insertdate FROM "tagdata.central"
> ORDER BY insertdate DESC LIMIT 1;
>
> But this gives me an error:
> ERROR: ORDER BY is only supported when the partition key is restricted
> by an EQ or an IN.
>
> So, after some trial and error and a lot of Googling, I learned that I
> must include all rows from the PRIMARY KEY from left to right in my
> query. Thus, this is the "best" I can get to work:
>
>
> SELECT
> *
> FROM
> "tagdata.central"
> WHERE
> "signalid" = 4002
> AND "monthyear" = 201908
> ORDER BY
> "fromtime" DESC
> LIMIT 10;
>
>
> The "monthyear" column, I crafted like a fool by incrementing the date
> one month after another until no results could be found anymore.
> The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
> query results. But these can't be as easily guessed as the "monthyear"
> values could.
>
> This is where I'm stuck!
>
> 1. This does not really feel like the ideal way to go. I think there is
> something more mature in modern IT systems. Can anyone tell me what is a
> better way to get these informations?
>
> 2. I need a way to learn all values that are in the "monthyear" and
> "signalid" columns in order to be able to craft that query.
> How can I achieve that in a reasonable way? As I said: The DB is round
> about 260 GB which makes it next to impossible to just "have a look" at
> the output of "SELECT *"..
>
> Thanks for your help!
>
> Best regards,
> Marc Richter
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>