Re: Timeout error in fetching million rows as results using clustering keys
The rendering tool renders a portion a very large image. It may fetch different data each time from billions of rows. So I don't think I can cache such large results. Since same results will rarely fetched again. Also do you know how I can do 2d range queries using Cassandra. Some other users suggested me using Solr. But is there any way I can achieve that without using any other technology. On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar ali.rac...@gmail.com wrote: Sorry, meant to say that way when you have to render, you can just display the latest cache. On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar ali.rac...@gmail.com wrote: I would probably do this in a background thread and cache the results, that way when you have to render, you can just cache the latest results. I don't know why Cassandra can't seem to be able to fetch large batch sizes, I've also run into these timeouts but reducing the batch size to 2k seemed to work for me. On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: We have UI interface which needs this data for rendering. So efficiency of pulling this data matters a lot. It should be fetched within a minute. Is there a way to achieve such efficiency On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar ali.rac...@gmail.com wrote: Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to do this all the time, or only once in a while? On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: yes it works for 1000 but not more than that. How can I fetch all rows using this efficiently? On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar ali.rac...@gmail.com wrote: Have you tried a smaller fetch size, such as 5k - 2k ? On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at
Re: Timeout error in fetching million rows as results using clustering keys
Data won't change much but queries will be different. I am not working on the rendering tool myself so I don't know much details about it. Also as suggested by you I tried to fetch data in size of 500 or 1000 with java driver auto pagination. It fails when the number of records are high (around 10) with following error: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar ali.rac...@gmail.com wrote: How often does the data change? I would still recommend a caching of some kind, but without knowing more details (how often the data is changing, what you're doing with the 1m rows after getting them, etc) I can't recommend a solution. I did see your other thread. I would also vote for elasticsearch / solr , they are more suited for the kind of analytics you seem to be doing. Cassandra is more for storing data, it isn't all that great for complex queries / analytics. If you want to stick to cassandra, you might have better luck if you made your range columns part of the primary key, so something like PRIMARY KEY(caseId, x, y) On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: The rendering tool renders a portion a very large image. It may fetch different data each time from billions of rows. So I don't think I can cache such large results. Since same results will rarely fetched again. Also do you know how I can do 2d range queries using Cassandra. Some other users suggested me using Solr. But is there any way I can achieve that without using any other technology. On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar ali.rac...@gmail.com wrote: Sorry, meant to say that way when you have to render, you can just display the latest cache. On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar ali.rac...@gmail.com wrote: I would probably do this in a background thread and cache the results, that way when you have to render, you can just cache the latest results. I don't know why Cassandra can't seem to be able to fetch large batch sizes, I've also run into these timeouts but reducing the batch size to 2k seemed to work for me. On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: We have UI interface which needs this data for rendering. So efficiency of pulling this data matters a lot. It should be fetched within a minute. Is there a way to achieve such efficiency On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar ali.rac...@gmail.com wrote: Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to do this all the time, or only once in a while? On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: yes it works for 1000 but not more than that. How can I fetch all rows using this efficiently? On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar ali.rac...@gmail.com wrote: Have you tried a smaller fetch size, such as 5k - 2k ? On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the
Re: schema generation in cassandra
Why are you creating new tables dynamically? I would try to use a static schema and use a collection (list / map / set) for storing arbitrary data. On Wed, Mar 18, 2015 at 2:52 PM, Ankit Agarwal agarwalankit.k...@gmail.com wrote: Hi, I am new to Cassandra, we are planning to use Cassandra for cloud base application in our development environment, so for this looking forward for best strategies to sync the schema for micro-services while deploy application on cloud foundry One way which I could use is Accessor interface with datastax-mapper and casandra-core driver. 1.) I have created a keyspace using core driver which will be generated on initialization of servlet *public* *void* init() *throws* ServletException { Cluster cluster = Cluster.*builder*().addContactPoint(127.0.0.1 ).build(); Session session = cluster.connect(); String keySpace=sampletest; session.execute(CREATE KEYSPACE IF NOT EXISTS + keySpace + WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }); .. 2.) This is my Accessor Interface which I used to generate Query for creating column family .. @Accessor *public* *interface* UserAccessor { @Query( CREATE TABLE sampletest.emp (id uuid PRIMARY KEY,name text,department text,location text,phone bigint ) WITH caching = '{ \keys\ : \ALL\ , \rows_per_partition\ : \NONE\ }' ) ResultSet create_table(); } 3.) Creating an instance of Accessor interface to provide mapping to our query to generate column family MappingManager mapper=*new* MappingManager(session); UserAccessor ua= mapper.createAccessor(UserAccessor.*class*); ua.create_table(); 4.) So far I have created a keyspace with a column family , now I want to map my data using POJO class mentioned below @Table(keyspace = sampletest, name = emp) *public* *class* Employee { @PartitionKey *private* UUID id; *private* String name; *private* String department; *private* String location; *private* Long phone; // getter setter method . } Is there any other better approach to achieve this especially for cloud environment -- Thanks Ankit Agarwal
Re: Timeout error in fetching million rows as results using clustering keys
Yeah, it may be that the process is being limited by swap. This page: https://gist.github.com/aliakhtar/3649e412787034156cbb#file-cassandra-install-sh-L42 Lines 42 - 48 list a few settings that you could try out for increasing / reducing the memory limits (assuming you're on linux). Also, are you using an SSD? If so make sure the IO scheduler is noop or deadline . On Wed, Mar 18, 2015 at 2:48 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Currently Cassandra java process is taking 1% of cpu (total 8% is being used) and 14.3% memory (out of total 4G memory). As you can see there is not much load from other processes. Should I try changing default parameters of memory in Cassandra settings. On Wed, Mar 18, 2015 at 5:33 AM, Ali Akhtar ali.rac...@gmail.com wrote: What's your memory / CPU usage at? And how much ram + cpu do you have on this server? On Wed, Mar 18, 2015 at 2:31 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Currently there is only single node which I am calling directly with around 15 rows. Full data will be in around billions per node. The code is working only for size 100/200. Also the consecutive fetching is taking around 5-10 secs. I have a parallel script which is inserting the data while I am reading it. When I stopped the script it worked for 500/1000 but not more than that. On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar ali.rac...@gmail.com wrote: If even 500-1000 isn't working, then your cassandra node might not be up. 1) Try running nodetool status from shell on your cassandra server, make sure the nodes are up. 2) Are you calling this on the same server where cassandra is running? Its trying to connect to localhost . If you're running it on a different server, try passing in the direct ip of your cassandra server. On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Data won't change much but queries will be different. I am not working on the rendering tool myself so I don't know much details about it. Also as suggested by you I tried to fetch data in size of 500 or 1000 with java driver auto pagination. It fails when the number of records are high (around 10) with following error: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar ali.rac...@gmail.com wrote: How often does the data change? I would still recommend a caching of some kind, but without knowing more details (how often the data is changing, what you're doing with the 1m rows after getting them, etc) I can't recommend a solution. I did see your other thread. I would also vote for elasticsearch / solr , they are more suited for the kind of analytics you seem to be doing. Cassandra is more for storing data, it isn't all that great for complex queries / analytics. If you want to stick to cassandra, you might have better luck if you made your range columns part of the primary key, so something like PRIMARY KEY(caseId, x, y) On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: The rendering tool renders a portion a very large image. It may fetch different data each time from billions of rows. So I don't think I can cache such large results. Since same results will rarely fetched again. Also do you know how I can do 2d range queries using Cassandra. Some other users suggested me using Solr. But is there any way I can achieve that without using any other technology. On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar ali.rac...@gmail.com wrote: Sorry, meant to say that way when you have to render, you can just display the latest cache. On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar ali.rac...@gmail.com wrote: I would probably do this in a background thread and cache the results, that way when you have to render, you can just cache the latest results. I don't know why Cassandra can't seem to be able to fetch large batch sizes, I've also run into these timeouts but reducing the batch size to 2k seemed to work for me. On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: We have UI interface which needs this data for rendering. So efficiency of pulling this data matters a lot. It should be fetched within a minute. Is there a way to achieve such efficiency On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar ali.rac...@gmail.com wrote: Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to do this all the time, or only once in a while? On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: yes it works for 1000 but not more than that. How can I fetch all rows using this
Re: Timeout error in fetching million rows as results using clustering keys
Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to do this all the time, or only once in a while? On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: yes it works for 1000 but not more than that. How can I fetch all rows using this efficiently? On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar ali.rac...@gmail.com wrote: Have you tried a smaller fetch size, such as 5k - 2k ? On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Also when I try the same query on console even while using limit of 2000 rows: cqlsh:images select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area100 and Area20 limit 2000; errors={}, last_host=127.0.0.1 Thanks and Regards, Mehak
schema generation in cassandra
Hi, I am new to Cassandra, we are planning to use Cassandra for cloud base application in our development environment, so for this looking forward for best strategies to sync the schema for micro-services while deploy application on cloud foundry One way which I could use is Accessor interface with datastax-mapper and casandra-core driver. 1.) I have created a keyspace using core driver which will be generated on initialization of servlet *public* *void* init() *throws* ServletException { Cluster cluster = Cluster.*builder*().addContactPoint(127.0.0.1).build(); Session session = cluster.connect(); String keySpace=sampletest; session.execute(CREATE KEYSPACE IF NOT EXISTS + keySpace + WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }); .. 2.) This is my Accessor Interface which I used to generate Query for creating column family .. @Accessor *public* *interface* UserAccessor { @Query( CREATE TABLE sampletest.emp (id uuid PRIMARY KEY,name text,department text,location text,phone bigint ) WITH caching = '{ \keys\ : \ALL\ , \rows_per_partition\ : \NONE\ }' ) ResultSet create_table(); } 3.) Creating an instance of Accessor interface to provide mapping to our query to generate column family MappingManager mapper=*new* MappingManager(session); UserAccessor ua= mapper.createAccessor(UserAccessor.*class*); ua.create_table(); 4.) So far I have created a keyspace with a column family , now I want to map my data using POJO class mentioned below @Table(keyspace = sampletest, name = emp) *public* *class* Employee { @PartitionKey *private* UUID id; *private* String name; *private* String department; *private* String location; *private* Long phone; // getter setter method . } Is there any other better approach to achieve this especially for cloud environment -- Thanks Ankit Agarwal
Re: Timeout error in fetching million rows as results using clustering keys
4g also seems small for the kind of load you are trying to handle (billions of rows) etc. I would also try adding more nodes to the cluster. On Wed, Mar 18, 2015 at 2:53 PM, Ali Akhtar ali.rac...@gmail.com wrote: Yeah, it may be that the process is being limited by swap. This page: https://gist.github.com/aliakhtar/3649e412787034156cbb#file-cassandra-install-sh-L42 Lines 42 - 48 list a few settings that you could try out for increasing / reducing the memory limits (assuming you're on linux). Also, are you using an SSD? If so make sure the IO scheduler is noop or deadline . On Wed, Mar 18, 2015 at 2:48 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Currently Cassandra java process is taking 1% of cpu (total 8% is being used) and 14.3% memory (out of total 4G memory). As you can see there is not much load from other processes. Should I try changing default parameters of memory in Cassandra settings. On Wed, Mar 18, 2015 at 5:33 AM, Ali Akhtar ali.rac...@gmail.com wrote: What's your memory / CPU usage at? And how much ram + cpu do you have on this server? On Wed, Mar 18, 2015 at 2:31 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Currently there is only single node which I am calling directly with around 15 rows. Full data will be in around billions per node. The code is working only for size 100/200. Also the consecutive fetching is taking around 5-10 secs. I have a parallel script which is inserting the data while I am reading it. When I stopped the script it worked for 500/1000 but not more than that. On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar ali.rac...@gmail.com wrote: If even 500-1000 isn't working, then your cassandra node might not be up. 1) Try running nodetool status from shell on your cassandra server, make sure the nodes are up. 2) Are you calling this on the same server where cassandra is running? Its trying to connect to localhost . If you're running it on a different server, try passing in the direct ip of your cassandra server. On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Data won't change much but queries will be different. I am not working on the rendering tool myself so I don't know much details about it. Also as suggested by you I tried to fetch data in size of 500 or 1000 with java driver auto pagination. It fails when the number of records are high (around 10) with following error: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar ali.rac...@gmail.com wrote: How often does the data change? I would still recommend a caching of some kind, but without knowing more details (how often the data is changing, what you're doing with the 1m rows after getting them, etc) I can't recommend a solution. I did see your other thread. I would also vote for elasticsearch / solr , they are more suited for the kind of analytics you seem to be doing. Cassandra is more for storing data, it isn't all that great for complex queries / analytics. If you want to stick to cassandra, you might have better luck if you made your range columns part of the primary key, so something like PRIMARY KEY(caseId, x, y) On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: The rendering tool renders a portion a very large image. It may fetch different data each time from billions of rows. So I don't think I can cache such large results. Since same results will rarely fetched again. Also do you know how I can do 2d range queries using Cassandra. Some other users suggested me using Solr. But is there any way I can achieve that without using any other technology. On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar ali.rac...@gmail.com wrote: Sorry, meant to say that way when you have to render, you can just display the latest cache. On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar ali.rac...@gmail.com wrote: I would probably do this in a background thread and cache the results, that way when you have to render, you can just cache the latest results. I don't know why Cassandra can't seem to be able to fetch large batch sizes, I've also run into these timeouts but reducing the batch size to 2k seemed to work for me. On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: We have UI interface which needs this data for rendering. So efficiency of pulling this data matters a lot. It should be fetched within a minute. Is there a way to achieve such efficiency On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar ali.rac...@gmail.com wrote: Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to
Re: Timeout error in fetching million rows as results using clustering keys
How often does the data change? I would still recommend a caching of some kind, but without knowing more details (how often the data is changing, what you're doing with the 1m rows after getting them, etc) I can't recommend a solution. I did see your other thread. I would also vote for elasticsearch / solr , they are more suited for the kind of analytics you seem to be doing. Cassandra is more for storing data, it isn't all that great for complex queries / analytics. If you want to stick to cassandra, you might have better luck if you made your range columns part of the primary key, so something like PRIMARY KEY(caseId, x, y) On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: The rendering tool renders a portion a very large image. It may fetch different data each time from billions of rows. So I don't think I can cache such large results. Since same results will rarely fetched again. Also do you know how I can do 2d range queries using Cassandra. Some other users suggested me using Solr. But is there any way I can achieve that without using any other technology. On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar ali.rac...@gmail.com wrote: Sorry, meant to say that way when you have to render, you can just display the latest cache. On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar ali.rac...@gmail.com wrote: I would probably do this in a background thread and cache the results, that way when you have to render, you can just cache the latest results. I don't know why Cassandra can't seem to be able to fetch large batch sizes, I've also run into these timeouts but reducing the batch size to 2k seemed to work for me. On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: We have UI interface which needs this data for rendering. So efficiency of pulling this data matters a lot. It should be fetched within a minute. Is there a way to achieve such efficiency On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar ali.rac...@gmail.com wrote: Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to do this all the time, or only once in a while? On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: yes it works for 1000 but not more than that. How can I fetch all rows using this efficiently? On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar ali.rac...@gmail.com wrote: Have you tried a smaller fetch size, such as 5k - 2k ? On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at
Re: Timeout error in fetching million rows as results using clustering keys
ya I have cluster total 10 nodes but I am just testing with one node currently. Total data for all nodes will exceed 5 billion rows. But I may have memory on other nodes. On Wed, Mar 18, 2015 at 6:06 AM, Ali Akhtar ali.rac...@gmail.com wrote: 4g also seems small for the kind of load you are trying to handle (billions of rows) etc. I would also try adding more nodes to the cluster. On Wed, Mar 18, 2015 at 2:53 PM, Ali Akhtar ali.rac...@gmail.com wrote: Yeah, it may be that the process is being limited by swap. This page: https://gist.github.com/aliakhtar/3649e412787034156cbb#file-cassandra-install-sh-L42 Lines 42 - 48 list a few settings that you could try out for increasing / reducing the memory limits (assuming you're on linux). Also, are you using an SSD? If so make sure the IO scheduler is noop or deadline . On Wed, Mar 18, 2015 at 2:48 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Currently Cassandra java process is taking 1% of cpu (total 8% is being used) and 14.3% memory (out of total 4G memory). As you can see there is not much load from other processes. Should I try changing default parameters of memory in Cassandra settings. On Wed, Mar 18, 2015 at 5:33 AM, Ali Akhtar ali.rac...@gmail.com wrote: What's your memory / CPU usage at? And how much ram + cpu do you have on this server? On Wed, Mar 18, 2015 at 2:31 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Currently there is only single node which I am calling directly with around 15 rows. Full data will be in around billions per node. The code is working only for size 100/200. Also the consecutive fetching is taking around 5-10 secs. I have a parallel script which is inserting the data while I am reading it. When I stopped the script it worked for 500/1000 but not more than that. On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar ali.rac...@gmail.com wrote: If even 500-1000 isn't working, then your cassandra node might not be up. 1) Try running nodetool status from shell on your cassandra server, make sure the nodes are up. 2) Are you calling this on the same server where cassandra is running? Its trying to connect to localhost . If you're running it on a different server, try passing in the direct ip of your cassandra server. On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Data won't change much but queries will be different. I am not working on the rendering tool myself so I don't know much details about it. Also as suggested by you I tried to fetch data in size of 500 or 1000 with java driver auto pagination. It fails when the number of records are high (around 10) with following error: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar ali.rac...@gmail.com wrote: How often does the data change? I would still recommend a caching of some kind, but without knowing more details (how often the data is changing, what you're doing with the 1m rows after getting them, etc) I can't recommend a solution. I did see your other thread. I would also vote for elasticsearch / solr , they are more suited for the kind of analytics you seem to be doing. Cassandra is more for storing data, it isn't all that great for complex queries / analytics. If you want to stick to cassandra, you might have better luck if you made your range columns part of the primary key, so something like PRIMARY KEY(caseId, x, y) On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: The rendering tool renders a portion a very large image. It may fetch different data each time from billions of rows. So I don't think I can cache such large results. Since same results will rarely fetched again. Also do you know how I can do 2d range queries using Cassandra. Some other users suggested me using Solr. But is there any way I can achieve that without using any other technology. On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar ali.rac...@gmail.com wrote: Sorry, meant to say that way when you have to render, you can just display the latest cache. On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar ali.rac...@gmail.com wrote: I would probably do this in a background thread and cache the results, that way when you have to render, you can just cache the latest results. I don't know why Cassandra can't seem to be able to fetch large batch sizes, I've also run into these timeouts but reducing the batch size to 2k seemed to work for me. On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: We have UI interface which needs this data for rendering. So efficiency of pulling this data matters a lot. It should be fetched within a minute. Is there a
Re: Timeout error in fetching million rows as results using clustering keys
We have UI interface which needs this data for rendering. So efficiency of pulling this data matters a lot. It should be fetched within a minute. Is there a way to achieve such efficiency On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar ali.rac...@gmail.com wrote: Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to do this all the time, or only once in a while? On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: yes it works for 1000 but not more than that. How can I fetch all rows using this efficiently? On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar ali.rac...@gmail.com wrote: Have you tried a smaller fetch size, such as 5k - 2k ? On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Also when I try the same query on console even while using limit of 2000 rows: cqlsh:images select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area100 and Area20 limit 2000; errors={}, last_host=127.0.0.1 Thanks and Regards, Mehak
Re: Timeout error in fetching million rows as results using clustering keys
I would probably do this in a background thread and cache the results, that way when you have to render, you can just cache the latest results. I don't know why Cassandra can't seem to be able to fetch large batch sizes, I've also run into these timeouts but reducing the batch size to 2k seemed to work for me. On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: We have UI interface which needs this data for rendering. So efficiency of pulling this data matters a lot. It should be fetched within a minute. Is there a way to achieve such efficiency On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar ali.rac...@gmail.com wrote: Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to do this all the time, or only once in a while? On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: yes it works for 1000 but not more than that. How can I fetch all rows using this efficiently? On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar ali.rac...@gmail.com wrote: Have you tried a smaller fetch size, such as 5k - 2k ? On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Also when I try the same query on console even while using limit of 2000 rows: cqlsh:images select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area100 and Area20 limit 2000; errors={}, last_host=127.0.0.1 Thanks and Regards, Mehak
Re: Timeout error in fetching million rows as results using clustering keys
Sorry, meant to say that way when you have to render, you can just display the latest cache. On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar ali.rac...@gmail.com wrote: I would probably do this in a background thread and cache the results, that way when you have to render, you can just cache the latest results. I don't know why Cassandra can't seem to be able to fetch large batch sizes, I've also run into these timeouts but reducing the batch size to 2k seemed to work for me. On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: We have UI interface which needs this data for rendering. So efficiency of pulling this data matters a lot. It should be fetched within a minute. Is there a way to achieve such efficiency On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar ali.rac...@gmail.com wrote: Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it seems like the difference would only be a few minutes. Do you have to do this all the time, or only once in a while? On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: yes it works for 1000 but not more than that. How can I fetch all rows using this efficiently? On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar ali.rac...@gmail.com wrote: Have you tried a smaller fetch size, such as 5k - 2k ? On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Also when I try the same query on console even while using limit of 2000 rows: cqlsh:images select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area100 and Area20 limit 2000; errors={}, last_host=127.0.0.1 Thanks and Regards, Mehak
Re: Timeout error in fetching million rows as results using clustering keys
Cassandra can certainly handle millions and even billions of rows, but... it is a very clear anti-pattern to design a single query to return more than a relatively small number of rows except through paging. How small? Low hundreds is probably a reasonable limit. It is also an anti-pattern to filter or analyze a large number of rows in a single query - that's why there are so many crazy restrictions and the requirement to use ALLOW FILTERING - to reinforce that Cassandra is designed for short and performant queries, not large-scale retrieval of a large number of rows. As a general rule, the user of ALLOW FILTERING is an anti-pattern and a yellow flag that you are doing something wrong. As a minor point, check your partition key - you should try to bucket rows that will tend to be accessed together so that they have locality so that they can be fetched together. Rather than using a raw x and y coordinate range, consider indexing by a chunk number and then you can query by chunk number for direct access to the partition and row key, without the need for inequality filtering. -- Jack Krupansky On Wed, Mar 18, 2015 at 3:22 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Also when I try the same query on console even while using limit of 2000 rows: cqlsh:images select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area100 and Area20 limit 2000; errors={}, last_host=127.0.0.1 Thanks and Regards, Mehak
Not seeing keyspace in nodetool compactionhistory
When I run nodetool compactionhistory , I'm only seeing the system keyspace, and OpsCenter keyspace in the compactions. I only see one mention of my own keyspace, but its only for the smallest table within that keyspace (containing only about 1k rows). My two other tables, containing 1.1m and 100k rows respectively, weren't to be seen. Any reason why that is? I did fill up the data in those two tables within the span of about 4 hours (I ran a script to migrate existing data from legacy rdbms dbs). Could that have something to do with it? I'm using SizeTieredCompactionStrategy for all tables.
Re: Timeout error in fetching million rows as results using clustering keys
From your description, it sounds like you have a single partition key with millions of clustered values on the same partition. That's a very wide partition. You may very likely be causing a lot of memory pressure in your Cassandra node (especially at 4G) while trying to execute the query. Although the hard upper limit is 2 billion values per partition key, the practical limit is much lower, sometimes more like 100k. Also with very wide partitions, you cannot take advantage of Cassandra's distributed nature for reads, only one node will be involved in the read, so one node will perform as well as a million nodes. If bounding by area is a common task, then it might make sense to put area or at least part of area into the partition key (bucket by area / 10 or / 100 or something) just to distribute the data around your cluster a little better. It makes your query path a little more involved, but it buys you parallelism (you could execute all area buckets in a given query simultaneously, and if your cluster is large enough only typically one node is involved for each area bucket). I wonder what your write pattern is like to fill the data in for a given case ID. Are you appending to the same partition key over a long period of time? If so, you may be scattering the data for a given partition key over a large number of SSTables, and slowing down the read dramatically. If you're using size tiered compaction, do nodetool compact on that table and wait for the node to settle down (0 outstanding/pending tasks in nodetool compactionstats), then see if performance improves (you may also be able to use nodetool cfhistograms to see how many sstables are being involved in a read typically, but if all your queries are timing out, I'm not sure if that will be an accurate reflection or not). It may fetch different data each time from billions of rows. My expectations were that Cassandra can handle a million rows easily. I have a data set several orders of magnitude larger than what you're talking about WRT your final data size, and with appropriate query and storage patterns, Cassandra can definitely handle this kind of data. One final note, your column names are pretty long. You pay to store each column name each time you store that column. On small data sets it doesn't matter, but at billions of rows it starts to add up. There's negligible (but nonzero) performance cost, but over time you may find that you have to scale out just because you're filling up disks. See http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html On Wed, Mar 18, 2015 at 6:19 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Cassandra can certainly handle millions and even billions of rows, but... it is a very clear anti-pattern to design a single query to return more than a relatively small number of rows except through paging. How small? Low hundreds is probably a reasonable limit. It is also an anti-pattern to filter or analyze a large number of rows in a single query - that's why there are so many crazy restrictions and the requirement to use ALLOW FILTERING - to reinforce that Cassandra is designed for short and performant queries, not large-scale retrieval of a large number of rows. As a general rule, the user of ALLOW FILTERING is an anti-pattern and a yellow flag that you are doing something wrong. As a minor point, check your partition key - you should try to bucket rows that will tend to be accessed together so that they have locality so that they can be fetched together. Rather than using a raw x and y coordinate range, consider indexing by a chunk number and then you can query by chunk number for direct access to the partition and row key, without the need for inequality filtering. -- Jack Krupansky On Wed, Mar 18, 2015 at 3:22 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w
Recommended TTL time for max. performance with DateCompactionStrategy?
I have a table which is going to be storing temporary search results. The results will be available for a short time ( anywhere from 1 to 24 hours) from the time of the search, and then should be deleted to clear up disk space. This is going to apply to all the rows within this table. What would be the recommended TTL time for this table, so that it works best with DateComactionStrategy, and causes the whole SSTable to get deleted rather than keep tombestones? Thanks.
RE: Problems after trying a migration
Hi Fabien, Thank you for the link ! That’s exactly what we want to do. But before starting this, we need to clean up the mess in order to get a clean cluster. Thanks for your help. Best regards, [cid:image001.png@01D061A4.2E073720] David CHARBONNIER Sysadmin T : +33 411 934 200 david.charbonn...@rgsystem.commailto:david.charbonn...@rgsystem.com ZAC Aéroport 125 Impasse Adam Smith 34470 Pérols - France www.rgsystem.comhttp://www.rgsystem.com/ [cid:image004.png@01D061A4.2E073720] De : Fabien Rousseau [mailto:fab...@yakaz.com] Envoyé : mercredi 18 mars 2015 17:32 À : user Objet : Re: Problems after trying a migration Hi David, There is an excellent article which describes exactly what you want to do (ie migrate from one DC to another DC) : http://planetcassandra.org/blog/cassandra-migration-to-ec2/ 2015-03-18 17:05 GMT+01:00 David CHARBONNIER david.charbonn...@rgsystem.commailto:david.charbonn...@rgsystem.com: Hi, We’re using Cassandra through the Datastax Enterprise package in version 4.5.1 (Cassandra version 2.0.8.39) with 7 nodes in a single datacenter. We need to move our Cassandra cluster from France to another country. To do this, we want to add a second 7-nodes datacenter to our cluster and stream all data between the two countries before dropping the first datacenter. On January 31st, we tried doing so but we had some problems: - New nodes in the other country have been installed like French nodes except for Datastax Enterprise version (4.5.1 in France and 4.6.1 in the other country which means Cassandra version 2.0.8.39 in France and 2.0.12.200 in the other country) - The following procedure has been followed: http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html but an error occurred during step 3. New nodes have been started before the cassandra-topology.properties file has been updated on the original datacenter. New nodes appeared in the original datacenter instead of the new one. - To recover our original cluster, we decommissionned every node of the new datacenter with the nodetool decommission command. On February 9th, nodes in the second datacenter have been restarted and joined the cluster. We had to decommission them just like before. On February 11th, we added disk space on our 7 running French nodes. To achieve this, we restarted the cluster but the nodes updated their perring informations and nodes from Luxembourg (decommissionned on February 9th) were present. This behaviour is described here: https://issues.apache.org/jira/browse/CASSANDRA-7825. So we cleaned system.peers table content. On March 11th, we needed to add an 8th node to our existing French cluster. We installed the same Datastax Enterprise version (4.5.1 with Cassandra 2.0.8.39) and tried to add this node to the cluster with this procedure: http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html. In OPSCenter, the node was joining the cluster and data streaming got stuck at 100%. After several hours, nodetool status showed us that the node was still joining but nothing in the logs let us know there was a problem. We restarted the node but it has no effect. Then we cleaned data and commitlog contents and try to add the node to the cluster again but without result. Last try was to add the node with auto_bootstrap : false in order to add the node to the cluster manually but it messed up with the data. So we shut down the node and decommissioned it (with nodetool removenode). The whole cluster has been repaired and we stopped doing anything. Now, our cluster has only 7 French nodes in which we can’t add any node. The OPSCenter data has disapeared and we work without any information about how our cluster is running. You’ll find attached to this email our current configuration and a screenshot of our OPSCenter metric page. Do you have some idea on how to clean up the mess and get our cluster running cleanly before we start our migration (France to another country like described in the beginning of this email)? Thank you. Best regards, [cid:image001.png@01D061A4.2E073720] David CHARBONNIER Sysadmin T : +33 411 934 200 david.charbonn...@rgsystem.commailto:david.charbonn...@rgsystem.com ZAC Aéroport 125 Impasse Adam Smith 34470 Pérols - France www.rgsystem.comhttp://www.rgsystem.com/ [cid:image004.png@01D061A4.2E073720] -- Fabien Rousseau [http://www.yakaz.com/img/logo_yakaz_small.png] www.yakaz.comhttp://www.yakaz.com/
Re: Problems after trying a migration
Hi David, There is an excellent article which describes exactly what you want to do (ie migrate from one DC to another DC) : http://planetcassandra.org/blog/cassandra-migration-to-ec2/ 2015-03-18 17:05 GMT+01:00 David CHARBONNIER david.charbonn...@rgsystem.com : Hi, We’re using Cassandra through the Datastax Enterprise package in version 4.5.1 (Cassandra version 2.0.8.39) with 7 nodes in a single datacenter. We need to move our Cassandra cluster from France to another country. To do this, we want to add a second 7-nodes datacenter to our cluster and stream all data between the two countries before dropping the first datacenter. On January 31st, we tried doing so but we had some problems: - New nodes in the other country have been installed like French nodes except for Datastax Enterprise version (4.5.1 in France and 4.6.1 in the other country which means Cassandra version 2.0.8.39 in France and 2.0.12.200 in the other country) - The following procedure has been followed: http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html but an error occurred during step 3. New nodes have been started before the *cassandra-topology.properties* file has been updated on the original datacenter. New nodes appeared in the original datacenter instead of the new one. - To recover our original cluster, we decommissionned every node of the new datacenter with the *nodetool decommission* command. On February 9th, nodes in the second datacenter have been restarted and joined the cluster. We had to decommission them just like before. On February 11th, we added disk space on our 7 running French nodes. To achieve this, we restarted the cluster but the nodes updated their perring informations and nodes from Luxembourg (decommissionned on February 9th) were present. This behaviour is described here: https://issues.apache.org/jira/browse/CASSANDRA-7825. So we cleaned *system.peers* table content. On March 11th, we needed to add an 8th node to our existing French cluster. We installed the same Datastax Enterprise version (4.5.1 with Cassandra 2.0.8.39) and tried to add this node to the cluster with this procedure: http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html. In OPSCenter, the node was joining the cluster and data streaming got stuck at 100%. After several hours, *nodetool status* showed us that the node was still joining but nothing in the logs let us know there was a problem. We restarted the node but it has no effect. Then we cleaned data and commitlog contents and try to add the node to the cluster again but without result. Last try was to add the node with *auto_bootstrap : false* in order to add the node to the cluster manually but it messed up with the data. So we shut down the node and decommissioned it (with *nodetool removenode*). The whole cluster has been repaired and we stopped doing anything. Now, our cluster has only 7 French nodes in which we can’t add any node. The OPSCenter data has disapeared and we work without any information about how our cluster is running. You’ll find attached to this email our current configuration and a screenshot of our OPSCenter metric page. Do you have some idea on how to clean up the mess and get our cluster running cleanly before we start our migration (France to another country like described in the beginning of this email)? Thank you. Best regards, *David CHARBONNIER* Sysadmin T : +33 411 934 200 david.charbonn...@rgsystem.com ZAC Aéroport 125 Impasse Adam Smith 34470 Pérols - France *www.rgsystem.com* http://www.rgsystem.com/ -- Fabien Rousseau aur...@yakaz.comwww.yakaz.com
Saving a file using cassandra
Hello, Finally, i have created my ring using cassandra. Please, i'd like to store a file replicated 2 times in my cluster. is that possible ? can you please send me a link for a tutorial ? Thanks a lot. Best Regards.
Upgrade from 1.2.19 to 2.0.12 -- seeing lots of SliceQueryFilter messages in system.log
After upgrading a 3 node Cassandra cluster from 1.2.19 to 2.0.12, I have an event storm of SliceQueryFilter messages flooding the Cassandra system.log file. WARN [ReadStage:1043] 2015-03-18 15:14:12,708 SliceQueryFilter.java (line 231) Read 201 live and 13539 tombstoned cells in KeyspaceMetadata.CF_Folder (see tombstone_warn_threshold). 200 columns was requested, slices=[154184c2-85c1-11e2-b12e-c2ed2ac02b21-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647, ranges=[dc70cafe-ed8a-11e2-a178-5756012ec923-dc70cafe-ed8a-11e2-a178-5756012ec923:!, deletedAt=1424741296925196, localDeletion=1424741340][82bcb57a-ed8c-11e2-8fbd-3fb065c6b097-82bcb57a-ed8c-11e2-8fbd-3fb065c6b097:!, deletedAt=1424741296925196,... This is the table definition referenced above: CREATE TABLE CF_Folder ( key blob, column1 timeuuid, column2 blob, value blob, PRIMARY KEY ((key), column1, column2) ) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.10 AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=518400 AND read_repair_chance=0.10 AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND compaction={'sstable_size_in_mb': '160', 'class': 'LeveledCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; How can I stop this event storm? Thanks, Rafael Caraballo Time Warner Cable This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
Re: Timeout error in fetching million rows as results using clustering keys
Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Also when I try the same query on console even while using limit of 2000 rows: cqlsh:images select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area100 and Area20 limit 2000; errors={}, last_host=127.0.0.1 Thanks and Regards, Mehak
Re: Timeout error in fetching million rows as results using clustering keys
Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Also when I try the same query on console even while using limit of 2000 rows: cqlsh:images select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area100 and Area20 limit 2000; errors={}, last_host=127.0.0.1 Thanks and Regards, Mehak
Re: Timeout error in fetching million rows as results using clustering keys
Have you tried a smaller fetch size, such as 5k - 2k ? On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Also when I try the same query on console even while using limit of 2000 rows: cqlsh:images select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area100 and Area20 limit 2000; errors={}, last_host=127.0.0.1 Thanks and Regards, Mehak
Re: Timeout error in fetching million rows as results using clustering keys
yes it works for 1000 but not more than that. How can I fetch all rows using this efficiently? On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar ali.rac...@gmail.com wrote: Have you tried a smaller fetch size, such as 5k - 2k ? On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi Jens, I have tried with fetch size of 1 still its not giving any results. My expectations were that Cassandra can handle a million rows easily. Is there any mistake in the way I am defining the keys or querying them. Thanks Mehak On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil jens.ran...@tink.se wrote: Hi, Try setting fetchsize before querying. Assuming you don't set it too high, and you don't have too many tombstones, that should do it. Cheers, Jens – Skickat från Mailbox https://www.dropbox.com/mailbox On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta meme...@cs.stonybrook.edu wrote: Hi, I have requirement to fetch million row as result of my query which is giving timeout errors. I am fetching results by selecting clustering columns, then why the queries are taking so long. I can change the timeout settings but I need the data to fetched faster as per my requirement. My table definition is: *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar, analysis_execution_uuid uuid, x double, y double, loc varchar, w double, h double, normalized varchar, type varchar, filehost varchar, filename varchar, image_uuid uuid, image_uri varchar, image_caseid varchar, image_mpp_x double, image_mpp_y double, image_width double, image_height double, objective double, cancer_type varchar, Area float, submit_date timestamp, points listdouble, PRIMARY KEY ((image_caseid),Area,uuid));* Here each row is uniquely identified on the basis of unique uuid. But since my data is generally queried based upon *image_caseid *I have made it partition key. I am currently using Java Datastax api to fetch the results. But the query is taking a lot of time resulting in timeout errors: Exception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at QueryDB.queryArea(TestQuery.java:59) at TestQuery.main(TestQuery.java:35) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Also when I try the same query on console even while using limit of 2000 rows: cqlsh:images select count(*) from results where image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area100 and Area20 limit 2000; errors={}, last_host=127.0.0.1 Thanks and Regards, Mehak
Re: schema generation in cassandra
Thanks! a lot for your responses, My question is , what all best practices used for database schema deployment for a microservice in cloud environment. e.g., shall we create it with deployement of microservice or it should be generated via code or should not be generated via code instead should be generated separately. On Wed, Mar 18, 2015 at 3:29 PM, Ali Akhtar ali.rac...@gmail.com wrote: Why are you creating new tables dynamically? I would try to use a static schema and use a collection (list / map / set) for storing arbitrary data. On Wed, Mar 18, 2015 at 2:52 PM, Ankit Agarwal agarwalankit.k...@gmail.com wrote: Hi, I am new to Cassandra, we are planning to use Cassandra for cloud base application in our development environment, so for this looking forward for best strategies to sync the schema for micro-services while deploy application on cloud foundry One way which I could use is Accessor interface with datastax-mapper and casandra-core driver. 1.) I have created a keyspace using core driver which will be generated on initialization of servlet *public* *void* init() *throws* ServletException { Cluster cluster = Cluster.*builder*().addContactPoint(127.0.0.1 ).build(); Session session = cluster.connect(); String keySpace=sampletest; session.execute(CREATE KEYSPACE IF NOT EXISTS + keySpace + WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }); .. 2.) This is my Accessor Interface which I used to generate Query for creating column family .. @Accessor *public* *interface* UserAccessor { @Query( CREATE TABLE sampletest.emp (id uuid PRIMARY KEY,name text,department text,location text,phone bigint ) WITH caching = '{ \keys\ : \ALL\ , \rows_per_partition\ : \NONE\ }' ) ResultSet create_table(); } 3.) Creating an instance of Accessor interface to provide mapping to our query to generate column family MappingManager mapper=*new* MappingManager(session); UserAccessor ua= mapper.createAccessor(UserAccessor.*class*); ua.create_table(); 4.) So far I have created a keyspace with a column family , now I want to map my data using POJO class mentioned below @Table(keyspace = sampletest, name = emp) *public* *class* Employee { @PartitionKey *private* UUID id; *private* String name; *private* String department; *private* String location; *private* Long phone; // getter setter method . } Is there any other better approach to achieve this especially for cloud environment -- Thanks Ankit Agarwal -- Thanks Regards Ankit Agarwal +91-9953235575
Re: Problems after trying a migration
Hi David; some input to get back to where you were : a) Start with the French cluster only and get it working with DSE 4.5.1 b) Opscenter keyspace is by default RF1; alter the keyspace to RF3 c) Take a full snapshot of all your nodes copy the files to a safe location on all the nodes To migrate the data into new cluster: a) Use the same version DSE 4.5.1 in Luxembourg bring up 1 node at a time. Check that the node has comeup in the new Datacenter.b) Bring up new nodes into the new Datacenter one at a timec) After all your new nodes are UP in Luxembourg, conduct a 'nodetool repair -parallel' d) Check in OpsCenter that you have all your nodes showing up (new and old)e) Start taking down your nodes in France, one at a timef) After all the nodes in France are down, conduct a 'nodetool repair -parallel' again g) Upgrade the nodes in Luxembourg to DSE 4.6.1 h) conduct a 'nodetool repair -parallel' again i) Upgrade to OpsCenter 5.1 Best of luck, hope this helps. Jan/ On Wednesday, March 18, 2015 1:01 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Mar 18, 2015 at 9:05 AM, David CHARBONNIER david.charbonn...@rgsystem.com wrote: - New nodes in the other country have been installed like French nodes except for Datastax Enterprise version (4.5.1 in France and 4.6.1 in the other country which means Cassandra version 2.0.8.39 in France and 2.0.12.200 in the other country) This is officially unsupported, and might cause of problems during this process. =Rob
Re: Upgrade from 1.2.19 to 2.0.12 -- seeing lots of SliceQueryFilter messages in system.log
On Wed, Mar 18, 2015 at 10:14 AM, Caraballo, Rafael rafael.caraba...@twcable.com wrote: After upgrading a 3 node Cassandra cluster from 1.2.19 to 2.0.12, I have an event storm of “ SliceQueryFilter” messages flooding the Cassandra system.log file. How can I stop this event storm? As the message says : (see tombstone_warn_threshold) The thing you are being warned about is that your write pattern results in a significant number of tombstones. In general this is a smell of badness in Cassandra, which is why the log message exists. To Resolve : 1) Increase tombstone_warn_threshold OR 2) Stop creating so many tombstones =Rob http://twitter.com/rcolidba
Limit on number of columns
Hello, For the limit of number of cellshttp://wiki.apache.org/cassandra/CassandraLimitations (columns *rows) per partition, I wonder what we mean by number of columns, since different rows may have different columns? Is the number of columns the number of columns of the biggest row, or the union of all columns across all rows? E.g. if I have two rows, one has ten columns and the other has ten different columns, would that be considered a total of ten or twenty columns? Thanks! Best, Oliver Oliver Ruebenacker | Solutions Architect Altisource(tm) 290 Congress St, 7th Floor | Boston, Massachusetts 02210 P: (617) 728-5582 | ext: 275585 oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com | www.Altisource.com *** This email message and any attachments are intended solely for the use of the addressee. If you are not the intended recipient, you are prohibited from reading, disclosing, reproducing, distributing, disseminating or otherwise using this transmission. If you have received this message in error, please promptly notify the sender by reply email and immediately delete this message from your system. This message and any attachments may contain information that is confidential, privileged or exempt from disclosure. Delivery of this message to any person other than the intended recipient is not intended to waive any right or privilege. Message transmission is not guaranteed to be secure or free of software viruses. ***
Re: Limit on number of columns
On Wed, Mar 18, 2015 at 12:43 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: For the limit of number of cells http://wiki.apache.org/cassandra/CassandraLimitations (columns *rows) per partition, I wonder what we mean by number of columns, since different rows may have different columns? Is the number of columns the number of columns of the biggest row, or the union of all columns across all rows? E.g. if I have two rows, one has ten columns and the other has ten different columns, would that be considered a total of ten or twenty columns? I tend to still think of this in terms of storage partitions and how many storage columns a given one may contain. It's possible that apache doc has not been updated to reflect the new language of partitions and cells and etc. A given partition can contain 2Bn storage columns, regardless of how many other columns there are in other partitions. =Rob
Re: Problems after trying a migration
On Wed, Mar 18, 2015 at 12:58 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Mar 18, 2015 at 9:05 AM, David CHARBONNIER david.charbonn...@rgsystem.com wrote: - New nodes in the other country have been installed like French nodes except for Datastax Enterprise version (4.5.1 in France and 4.6.1 in the other country which means Cassandra version 2.0.8.39 in France and 2.0.12.200 in the other country) This is officially unsupported, and might cause of problems during this process. As regards your other situation, I suggest joining #cassandra and pointing people there towards your summary and interactively discussing it with them. Mailing list lag is not best for operational issues. :) =Rob
Re: Limit on number of columns
Generally a concern for limitations on number of columns is a concern about storage for rows in a partition. Cassandra is a column-oriented database, but this is really referring to its cell-oriented storage structure, with each column name and column value pair being a single cell (except collections, which may occupy multiple cells per column, one for each value in the collection.) So, the issue is not the total number of column names used, but the total number of cells used in a partition. So, for your example, you have 20 cell values and... 20 column names. -- Jack Krupansky On Wed, Mar 18, 2015 at 3:43 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: Hello, For the limit of number of cells http://wiki.apache.org/cassandra/CassandraLimitations (columns *rows) per partition, I wonder what we mean by number of columns, since different rows may have different columns? Is the number of columns the number of columns of the biggest row, or the union of all columns across all rows? E.g. if I have two rows, one has ten columns and the other has ten different columns, would that be considered a total of ten or twenty columns? Thanks! Best, Oliver Oliver Ruebenacker | Solutions Architect Altisource™ 290 Congress St, 7th Floor | Boston, Massachusetts 02210 P: (617) 728-5582 | ext: 275585 oliver.ruebenac...@altisource.com | www.Altisource.com *** This email message and any attachments are intended solely for the use of the addressee. If you are not the intended recipient, you are prohibited from reading, disclosing, reproducing, distributing, disseminating or otherwise using this transmission. If you have received this message in error, please promptly notify the sender by reply email and immediately delete this message from your system. This message and any attachments may contain information that is confidential, privileged or exempt from disclosure. Delivery of this message to any person other than the intended recipient is not intended to waive any right or privilege. Message transmission is not guaranteed to be secure or free of software viruses. ***