timezone time series data model

2012-04-29 Thread samal
Hello List,

I need suggestion/ recommendation on time series data.

I have requirement where users belongs to different timezone and they can
subscribe to global group.
When users at specific timezone send update to group it is available to
every user in different timezone.

I am using GroupSubscribedUsers CF where all update to group are push to
"Each User" time line, and key is timelined by useruuid_date(one day update
of all groups) and columns are group updates.

GroupSubscribedUsers ={
user1uuid_30042012:{//this user belongs to same timezone
 timeuuid1:JSON[group1update1]
 timeuuid2:JSON[group2update2]
 timeuuid3:JSON[group1update2]
timeuuid4:JSON[group4update1]
   },
  user2uuid_30042012:{//this user belongs to different timezone where date
has changed already  to 1may but  30 april is getting update
 timeuuid1:JSON[group1update1]
 timeuuid2:JSON[group2update2]
 timeuuid3:JSON[group1update2]
timeuuid4:JSON[group4update1]
timeuuid5:JSON[groupNupdate1]
   },

}

I have noticed  this approach is good for single time zone when different
timezone come into picture it breaks.

I am thinking of like when user pushed update to group ->get user who is
subscribed to group->check user timezone->push time series in user time
zone. So for one user update will be on 30april where as other may have on
29april and 1may, using timestamps i can find out hours ago update came.

Is there any better approach?


Thanks,

>>>Samal


Re: Cassandra backup queston regarding commitlogs

2012-04-29 Thread Roshan
Hi Aaron

Thanks for the comments. Yes for the durability will keep them in a safe
place. But such crash situation, how can I restore the data (because those
are not in a SSTable and only in commit log). 

Do I need to replay only that commit log when server starts after crash?
Will it override the same keys with values?

Appreciate your reply on this.

Kind Regards

/Roshan

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/deleted-tp7508823p7512499.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


incremental_backups

2012-04-29 Thread Tamar Fraenkel
Hi!
I wonder what are the advantages of doing incremental snapshot over non
incremental?
Are the snapshots smaller is size? Are there any other implications?
Thanks,

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

ta...@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956
<>

Re: Cassandra backup queston regarding commitlogs

2012-04-29 Thread aaron morton
> 1. If I already have a Cassandra cluster running, would changing the  
> incremental_backups parameter in the cassandra.yaml of each node, and then 
> restart it do the trick?
Yes it is a per node setting. 

> 2. Assuming I am creating a daily snapshot, what is the gain from setting 
> incremental backup to true?

Better point in time recovery on a node. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/04/2012, at 6:41 PM, Tamar Fraenkel wrote:

> I want to add a couple of questions regrading incremental backups:
> 1. If I already have a Cassandra cluster running, would changing the  
> incremental_backups parameter in the cassandra.yaml of each node, and then 
> restart it do the trick?
> 2. Assuming I am creating a daily snapshot, what is the gain from setting 
> incremental backup to true?
> 
> Thanks,
> Tamar
> 
> Tamar Fraenkel 
> Senior Software Engineer, TOK Media 
> 
> 
> 
> ta...@tok-media.com
> Tel:   +972 2 6409736 
> Mob:  +972 54 8356490 
> Fax:   +972 2 5612956 
> 
> 
> 
> 
> 
> On Sat, Apr 28, 2012 at 4:04 PM, Roshan  wrote:
> Hi
> 
> Currently I am taking daily snapshot on my keyspace in production and
> already enable the incremental backups as well.
> 
> According to the documentation, the incremental backup option will create an
> hard-link to the backup folder when new sstable is flushed. Snapshot will
> copy all the data/index/etc. files to a new folder.
> 
> *Question:*
> What will happen (with enabling the incremental backup) when crash (due to
> any reason) the Cassandra before flushing the data as a SSTable (inserted
> data still in commitlog). In this case how can I backup/restore data?
> 
> Do I need to backup the commitlogs as well and and replay during the server
> start to restore the data in commitlog files?
> 
> Thanks.
> 
> 
> 
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-backup-queston-regarding-commitlogs-tp7508823.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.
> 



Re: Cassandra backup queston regarding commitlogs

2012-04-29 Thread aaron morton
Each mutation is applied to the commit log before being applied to the 
memtable. On server start the SSTables are read before replaying the commit 
logs. This is part of the crash only software design and happens for every 
start.

AFAIk there is no facility to snapshot commit log files as they are closed. The 
best advice would be to to keep them on a mirror set for durability. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/04/2012, at 1:04 AM, Roshan wrote:

> Hi
> 
> Currently I am taking daily snapshot on my keyspace in production and
> already enable the incremental backups as well.
> 
> According to the documentation, the incremental backup option will create an
> hard-link to the backup folder when new sstable is flushed. Snapshot will
> copy all the data/index/etc. files to a new folder.
> 
> *Question:*
> What will happen (with enabling the incremental backup) when crash (due to
> any reason) the Cassandra before flushing the data as a SSTable (inserted
> data still in commitlog). In this case how can I backup/restore data?
> 
> Do I need to backup the commitlogs as well and and replay during the server
> start to restore the data in commitlog files?
> 
> Thanks.
> 
> 
> 
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-backup-queston-regarding-commitlogs-tp7508823.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.



Re: nodetool repair cassandra 0.8.4 HELP!!!

2012-04-29 Thread aaron morton
When you start a node does it log that it's opening SSTables ?

After starting what does nodetool cfstats say for the node ?

Can you connect with cassandra-cli and do a get ?

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/04/2012, at 10:45 PM, Raj N wrote:

> I tried it on 1 column family. I believe there is a bug in 0.8* where repair 
> ignores the cf. I tried this multiple times on different nodes. Every time 
> the disk util was going uo to 80% on a 500 GB disk. I would eventually kill 
> the repair. I only have 60GB worth data. I see this JIRA -
> 
> https://issues.apache.org/jira/browse/CASSANDRA-2324 
> 
> But that says it was fixed in 0.8 beta. Is this still broken in 0.8.4?
> 
> I also don't understand why the data was inconsistent in the first place. I 
> read and write at LOCAL_QUORUM. 
> 
> Thanks
> -Raj
> 
> On Sun, Apr 29, 2012 at 2:06 AM, Watanabe Maki  
> wrote:
> You should run repair. If the disk space is the problem, try to cleanup and 
> major compact before repair.
> You can limit the streaming data by running repair for each column family 
> separately.
> 
> maki
> 
> On 2012/04/28, at 23:47, Raj N  wrote:
> 
> > I have a 6 node cassandra cluster DC1=3, DC2=3 with 60 GB data on each 
> > node. I was bulk loading data over the weekend. But we forgot to turn off 
> > the weekly nodetool repair job. As a result, repair was interfering when we 
> > were bulk loading data. I canceled repair by restarting the nodes. But 
> > unfortunately after the restart it looks like I dont have any data on those 
> > nodes when I use list on cassandra-cli. I ran repair on one of the effected 
> > nodes, but repair seems to be taking forever. Disk space has almost 
> > tripled. I stopped the repair again in fear of running out of disk space. 
> > After restart, the disk space is at 50% where as the good nodes are at 25%. 
> > How should I proceed from here.  When I run list on cassandra-cli I do see 
> > data on the effected node. But how can I be sure I have all the data. 
> > Should I run repair again. Should I cleanup the disk by clearing snapshots. 
> > Or should I just drop column families and bulk load the data again?
> >
> > Thanks
> > -Raj
> 



Re: Can column type be changed dynamically?

2012-04-29 Thread aaron morton
That sounds right to me. 

A
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/04/2012, at 5:00 AM, Paolo Bernardi wrote:

> Apparently IntegerType is based on Java's BigInteger.
> 
> http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=src/java/org/apache/cassandra/db/marshal/IntegerType.java;hb=HEAD
> 
> Given the message, I suspect that you got some values between -2^15 and 
> 2^15-1 (the range of a short int) that have been serialized with two bytes. 
> Any confirmation on this?
> 
> If this is true, changing the type like you tried to do might not be so 
> straightforward.
> 
> Paolo
> 
> On Apr 27, 2012 6:55 PM, "马超"  wrote:
> After I update the column type:
> 
> update column family User with column_metadata = [{column_name : 77, 
> validation_class : Int32Type}];
> 
> I can't list the data in User column family:
> 
> list User;
> 
> RowKey: 1234
> A int is exactly 4 bytes: 2
> 
> Any ideas for this?
> 
> Thanks,
> 
> 2012/4/27 马超 
> Thanks a lot!
> I will go ahead~
> 
> 
> 2012/4/27 Sylvain Lebresne 
> On Fri, Apr 27, 2012 at 5:26 PM, 马超  wrote:
> > Hi all,
> >
> > I want to change one of my column type from IntegerType to Int32Type
> > dynamically. I'm sure all datas in that column are int32 type indeed. So I
> > want changing the column type by:
> >
> > update column family XXX with column_metadata = [{column_name : 'xxx',
> > validation_class : Int32Type}];
> >
> > Is there any harm to do this?
> 
> There isn't (as long as you're right to be sure of course).
> 
> --
> Sylvain
> 
> 



Re: AssertionError: originally calculated column size ...

2012-04-29 Thread aaron morton
Looks a bit like https://issues.apache.org/jira/browse/CASSANDRA-3579 but that 
was fixed in 1.0.7

Is this still an issue ? Are you able to reproduce the fault ? 

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 27/04/2012, at 6:56 PM, Patrik Modesto wrote:

> Hi,
> 
> I've 4 node cluster of Cassandra 1.0.9. There is a rfTest3 keyspace
> with RF=3 and one CF with two secondary indexes. I'm importing data
> into this CF using Hadoop Mapreduce job, each row has less than 10
> colkumns. From JMX:
> MaxRowSize:  1597
> MeanRowSize: 369
> 
> And there are some tens of millions of rows.
> 
> It's write-heavy usage and there is a big pressure on each node, there
> are quite some dropped mutations on each node. After ~12 hours of
> inserting I see these assertion exceptiona on 3 out of four nodes:
> 
> ERROR 06:25:40,124 Fatal exception in thread Thread[HintedHandoff:1,1,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
> java.lang.AssertionError: originally calculated column size of
> 629444349 but now it is 588008950
>at 
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:388)
>at 
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:256)
>at 
> org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:84)
>at 
> org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:437)
>at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>at java.lang.Thread.run(Thread.java:662)
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.AssertionError: originally calculated column size of
> 629444349 but now it is 588008950
>at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>at 
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:384)
>... 7 more
> Caused by: java.lang.AssertionError: originally calculated column size
> of 629444349 but now it is 588008950
>at 
> org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:124)
>at 
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160)
>at 
> org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:161)
>at 
> org.apache.cassandra.db.compaction.CompactionManager$7.call(CompactionManager.java:380)
>at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>... 3 more
> 
> 
> Few lines regarding Hints from the output.log:
> 
> INFO 06:21:26,202 Compacting large row
> system/HintsColumnFamily:7000 (1712834057
> bytes) incrementally
> INFO 06:22:52,610 Compacting large row
> system/HintsColumnFamily:1000 (2616073981
> bytes) incrementally
> INFO 06:22:59,111 flushing high-traffic column family
> CFS(Keyspace='system', ColumnFamily='HintsColumnFamily') (estimated
> 305147360 bytes)
> INFO 06:22:59,813 Enqueuing flush of
> Memtable-HintsColumnFamily@833933926(3814342/305147360 serialized/live
> bytes, 7452 ops)
> INFO 06:22:59,814 Writing
> Memtable-HintsColumnFamily@833933926(3814342/305147360 serialized/live
> bytes, 7452 ops)



Re: Node join streaming stuck at 100%

2012-04-29 Thread aaron morton
Did you restart ? All good?

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 27/04/2012, at 9:49 AM, Bryce Godfrey wrote:

> This is the second node I’ve joined to my cluster in the last few days, and 
> so far both have become stuck at 100% on a large file according to netstats.  
> This is on 1.0.9, is there anything I can do to make it move on besides 
> restarting Cassandra?  I don’t see any errors or warns in logs for either 
> server, and there is plenty of disk space.
>  
> On the sender side I see this:
> Streaming to: /10.20.1.152
>/opt/cassandra/data/MonitoringData/PropertyTimeline-hc-80540-Data.db 
> sections=1 progress=82393861085/82393861085 - 100%
>  
> On the node joining I don’t see this file in netstats, and all pending 
> streams are sitting at 0%
>  
>  
>  



Re: Data model question, storing Queue Message

2012-04-29 Thread aaron morton
Message Queue is often not a great use case for Cassandra. For information on 
how to handle high delete workloads see 
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

It hard to create a model without some idea of the data load, but I would 
suggest you start with:

CF: UserMessages
Key: ReceiverID
Columns : column name = TimeUUID ; column value = message ID and Body

That will order the messages by time. 

Depending on load (and to support deleting a previous months messages) you may 
want to partition the rows by month:

CF: UserMessagesMonth
Key: ReceiverID+MM
Columns : column name = TimeUUID ; column value = message ID and Body

Everything the same as before. But now a user has a row for each month and 
which you can delete as a whole. This also helps avoid very big rows. 

> I really don't think that storage will be an issue, I have 2TB per nodes, 
> messages are 1KB limited.
I would suggest you keep the per node limit to 300 to 400 GB. It can take a 
long time to compact, repair and move the data when it gets above 400GB. 

Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 27/04/2012, at 1:30 AM, Morgan Segalis wrote:

> Hi everyone !
> 
> I'm fairly new to cassandra and I'm not quite yet familiarized with column 
> oriented NoSQL model.
> I have worked a while on it, but I can't seems to find the best model for 
> what I'm looking for.
> 
> I have a Erlang software that let user connecting and communicate with each 
> others, when an user (A) sends
> a message to a disconnected user (B), it stores it on the database and wait 
> for the user (B) to connect and retrieve
> the message queue, and deletes it. 
> 
> Here's some key point : 
> - Users are identified by integer IDs
> - Each message are unique by combination of : Sender ID - Receiver ID - 
> Message ID - time
> 
> I have a queue Message, and here's the operations I would need to do as fast 
> as possible : 
> 
> - Store from 1 to X messages per registered user
> - Get the number of stored messages per user (Can be a incremental variable 
> updated at each store // this is often retrieved)
> - retrieve all messages from an user at once.
> - delete all messages from an user at once.
> - delete all messages that are older than Y months (from all users).
> 
> I really don't think that storage will be an issue, I have 2TB per nodes, 
> messages are 1KB limited.
> I'm really looking for speed rather than storage optimization.
> 
> My configuration is 2 dedicated server which are both :
> - 4 x Intel i7 2.66 Ghz
> - 64 bits
> - 24 Go
> - 2 TB
> 
> Thank you all.



Re: Maintain sort order on updatable property and pagination

2012-04-29 Thread aaron morton
> . Is there a better way to solve this in real time.
Not really. If however you can send a row level delete before the insert you 
dont need to read first. Of course that deletes all the other data :)

If you create a secondary index on a column value, the index will be updated 
when you change the value. Note that it has to do the same thing you do: read 
and delete the old value. 

> Also for pagination, we have to set range for columnNames. If we know the 
> last page's last columnName we can get the next page. What if we want to go 
> from page 2 to page 6, this seems impossible as of now. Any suggestion?
You will need to read the intermediate pages. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/04/2012, at 11:28 PM, Rajat Mathur wrote:

> Hi All, 
> 
> I am using property of columns i.e., they are in sorted order to store sort 
> orders (I believe everyone else is also using the same).
> But if I want to maintain sort order on a property, whose value changes, I 
> would have to perform read and delete operation. Is there a better way to 
> solve this in real time.
> 
> Also for pagination, we have to set range for columnNames. If we know the 
> last page's last columnName we can get the next page. What if we want to go 
> from page 2 to page 6, this seems impossible as of now. Any suggestion?
> 
> Thank you.
> 
> 



Re: Question regarding major compaction.

2012-04-29 Thread aaron morton
Depends on your definition of significantly, there are a few things to 
consider. 

* Reading from SSTables for a request is a serial operation. Reading from 2 
SSTables will take twice as long as 1. 

* If the data in the One Big File™ has been overwritten, reading it is a waste 
of time. And it will continue to be read until it the row is compacted away. 

* You will need to get min_compaction_threshold (CF setting) SSTables that big 
before automatic compaction will pickup the big file. 

On the other side: Some people do report getting value from nightly major 
compactions. They also manage their cluster to reduce the impact of performing 
the compactions.

Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/04/2012, at 9:37 PM, Fredrik wrote:

> Exactly, but why would reads be significantly slower over time when including 
> just one more, although sometimes large, SSTable in the read?
> 
> Ji Cheng skrev 2012-04-26 11:11:
>> 
>> I'm also quite interested in this question. Here's my understanding on this 
>> problem.
>> 
>> 1. If your workload is append-only, doing a major compaction shouldn't 
>> affect the read performance too much, because each row appears in one 
>> sstable anyway. 
>> 
>> 2. If your workload is mostly updating existing rows, then more and more 
>> columns will be obsoleted in that big sstable created by major compaction. 
>> And that super big sstable won't be compacted until you either have another 
>> 3 similar-sized sstables or start another major compaction. But I am not 
>> very sure whether this will be a major problem, because you only end up with 
>> reading one more sstable. Using size-tiered compaction against mostly-update 
>> workload itself may result in reading multiple sstables for a single row 
>> key. 
>> 
>> Please correct me if I am wrong.
>> 
>> Cheng
>> 
>> 
>> On Thu, Apr 26, 2012 at 3:50 PM, Fredrik  
>> wrote:
>> In the tuning documentation regarding Cassandra, it's recomended not to run 
>> major compactions.
>> I understand what a major compaction is all about but I'd like an in depth 
>> explanation as to why reads "will continually degrade until the next major 
>> compaction is manually invoked".
>> 
>> From the doc:
>> "So while read performance will be good immediately following a major 
>> compaction, it will continually degrade until the next major compaction is 
>> manually invoked. For this reason, major compaction is NOT recommended by 
>> DataStax."
>> 
>> Regards
>> /Fredrik
>> 
> 



Re: Cassandra and harddrives

2012-04-29 Thread aaron morton
Also i would avoid using HaProxy is possible. The best judge of a nodes 
availability is the client, and it can varies per row key. 

The exception is when you are using a web server that does not support state, 
such as php. The solution is not to use php.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/04/2012, at 4:41 PM, Maki Watanabe wrote:

> If your shared disk is super fast enough to handle IO requests from
> multiple cassandra node, you can do it in theory. And the disk will be
> the single point of failure in your system.
> For optimal performance, each node should have at least 2 hdd, one for
> commitlog and one for data.
> 
> maki
> 
> 2012/4/26 samal :
>> Each node need its own HDD for multiple copies. cant share it with others
>> node.
>> 
>> 
>> On Thu, Apr 26, 2012 at 8:52 AM, Benny Rönnhager
>>  wrote:
>>> 
>>> Hi!
>>> 
>>> I am building a database with several hundred thousands of images.
>>> have just learned that HaProxy is a very good fronted to a couple of
>>> Cassandra nodes. I understand how that works but...
>>> 
>>> Must every single node (mac mini) have it's own external harddrive with
>>> the same data (images) or can I just use one hard drive that can be
>>> accessed by all nodes?
>>> 
>>> What is the recommended way to do this?
>>> 
>>> Thanks in advance.
>>> 
>>> Benny
>> 
>> 



Re: EC2 Best Practices

2012-04-29 Thread aaron morton
> node that fail had the token id of 0 (this is the seed node - right?).
Seed nodes are listed in the seeds: section of the cassandra.yaml file. 

Using 0 as a token for a node is normal.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/04/2012, at 10:18 AM, Dave Brosius wrote:

> 0 is a perfectly valid id.
> 
> node - 1 is modulo the maximum token value. that token range is 0  -  2**127
> 
> so node - 1 in this case is 2**127
> 
> 
> - Original Message -
> From: "Deno Vichas"  
> Sent: Wed, April 25, 2012 18:11
> Subject: Re: EC2 Best Practices
> 
> 
>  
> has anybody written up anything related to recovery for fails in EC2?
> 
> this morning i woke up to find 1 (of 4) nodes marked as unreachable.  i used 
> the datastax (1.0.7) ami to set u p my clu ster and the node that fail had 
> the token id of 0 (this is the seed node - right?).  the docs says to replace 
> a failed node to set the token to failed node - 1, which i don't think would 
> work.  luckily rebooting the machine  
> 
> but i'm left with the question of what should i do if node with the token ID 
> of 0 fails and should i even have a node with that ID.
> 
> 
> thanks,
> deno
> 
> 
> 
> On 2/23/2012 11:21 AM, aaron morton wrote:
>> General EC2 setup.
>>  
>> http://www.datastax.com/docs/1.0/install/install_ami
>> http://wiki.apache.o rg/cassa ndra/CloudConfig
>>  
>> Cassandra with a VPN on EC2. From memory it talks about using the VPN within 
>> EC2. 
>> http://blog.vcider.com/2011/09/running-cassandra-on-a-virtual-network-in-ec2/
>>  
>> Clients need a single port (9160) to talk to the cluster.
>>  
>> Hope that helps. 
>> 
>>  
>> 
>> -
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 24/02/2012, at 3:46 AM, Philip Shon wrote:
>> 
>>> Are there any good resources for best practices when running Cassandra 
>>> within EC2? I'm particularly interested in the security issues, when the 
>>> servers communicating w/ Cassandra are outside of EC2.
>>>  
>>> Thanks,
>>>  
>>> -Phil
> 
>  



Re: Bad Request: No indexed columns present in by-columns clause with "equals" operator

2012-04-29 Thread aaron morton
Check there is a single schema version on the cluster, in the cassandra-cli use

describe cluster;


Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/04/2012, at 3:33 AM, mdione@orange.com wrote:

> De : mdione@orange.com [mailto:mdione@orange.com]
>>  Should I understand that when the indexes are finished being built a)
>> the «Built indexes» list should be empty and b) there should be no pending
>> compactions? Because that's exactly what I have now but I still can't
>> use the column HBX_FIL_STATUS in the where clause...
> 
>  Ok, answering myself a little: it's more strange than that. When indexes are 
> finished being built, they appear in the list, and there should be no 
> compactions 
> or anything else in «nodetool compactionstats», or at least none related to 
> indexes.
> 
>  Also useful: you'll see this type of messages in the system log (if the log 
> level 
> is low enough):
> 
> INFO [Creating index: hbx_file_test2.hbx_file_test2_hbx_fil_date_idx] 
> 2012-04-25 16:04:48,666 SecondaryIndex.java (line 191) Index build of 
> hbx_file_test2.hbx_file_test2_hbx_fil_date_idx complete
> 
>  Now, back to my problem. We have a 6 node cluster in 2 datacenters, 
> the KS has a CF of 4 (2 in each DC). I created the index in node 1, but
> node 4 did all the work (we have log level down to debug, so we can see 
> these kind of things, and also the creation appears in the output of 
> «nodetool compactionstats» as noted above). We have seen similar messages 
> in 4 nodes, but not all 6. And now we can make the query on any of the 4 
> nodes with success.
> 
>  But, and of course there's a but, we can't in the other 2, we still 
> get the error in the subject. And of course we see a difference in 
> «describe avatars»: we get «Built indexes: 
> [HBX_FILE.HBX_FILE_HBX_FIL_STATUS_idx]»
> for the 4 nodes where the query works, and «Built indexes: []». 
> 
>  No surprises there, except that it doesn't make any sense to us. 
> How to continue?
> 
> --
> Marcos Dione
> SysAdmin
> Astek Sud-Est
> pour FT/TGPF/OPF/PORTAIL/DOP/HEBEX @ Marco Polo
> 04 97 12 62 45 - mdione@orange.com
> 
> _
> 
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> France Telecom - Orange decline toute responsabilite si ce message a ete 
> altere, deforme ou falsifie. Merci.
> 
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, France Telecom - Orange is not liable for messages 
> that have been modified, changed or falsified.
> Thank you.
> 



Re: Crash by truncate with cassandra 1.1

2012-04-29 Thread aaron morton
Did you get a solution on this one ? 

It looks like you ran out of memory on the machine…

> Caused by: java.lang.OutOfMemoryError: Map failed
> at sun.nio.ch.FileChannelImpl.map0(Native Method)
> ... 7 more

cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/04/2012, at 7:36 PM, Pierre Chalamet wrote:

> Hi,
>  
> With Cassandra 1.1, I have the following crash on a fresh new single node 
> cluster running on Windows 7.
>  
> On client side:
> create keyspace toto;
> create column family titi;
> truncate titi;
>  
> The crash server side (the server is dead then) :
>  
> D:\Cassandra\apache-cassandra-1.1.0\bin>cassandra.bat
> Starting Cassandra Server
> INFO 09:30:29,768 Logging initialized
> INFO 09:30:29,889 JVM vendor/version: Java HotSpot(TM) Client VM/1.6.0_31
> INFO 09:30:29,889 Heap size: 1070399488/1070399488
> INFO 09:30:29,890 Classpath: 
> D:\Cassandra\apache-cassandra-1.1.0\conf;D:\Cassandra\apache-cassandra-1.1.0\lib\antlr-3.2.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\apache-cassandra-1.1.0.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\apache-cassandra-clientutil-1.1.0.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\apache-cassandra-thrift-1.1.0.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\avro-1.4.0-fixes.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\avro-1.4.0-sources-fixes.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\commons-cli-1.1.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\commons-codec-1.2.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\commons-lang-2.4.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\compress-lzf-0.8.4.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\concurrentlinkedhashmap-lru-1.2.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\guava-r08.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\high-scale-lib-1.1.2.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\jackson-core-asl-1.9.2.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\jackson-mapper-asl-1.9.2.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\jamm-0.2.5.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\jline-0.9.94.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\json-simple-1.1.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\libthrift-0.7.0.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\log4j-1.2.16.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\metrics-core-2.0.3.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\servlet-api-2.5-20081211.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\slf4j-api-1.6.1.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\slf4j-log4j12-1.6.1.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\snakeyaml-1.6.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\snappy-java-1.0.4.1.jar;D:\Cassandra\apache-cassandra-1.1.0\lib\snaptree-0.1.jar;D:\Cassandra\apache-cassandra-1.1.0\build\classes\main;D:\Cassandra\apache-cassandra-1.1.0\build\classes\thrift;D:\Cassandra\apache-cassandra-1.1.0\lib\jamm-0.2.5.jar
> INFO 09:30:29,901 JNA not found. Native methods will be disabled.
> INFO 09:30:29,955 Loading settings from 
> file:/D:/Cassandra/apache-cassandra-1.1.0/conf/cassandra.yaml
> INFO 09:30:30,243 DiskAccessMode 'auto' determined to be standard, 
> indexAccessMode is standard
> INFO 09:30:30,708 Global memtable threshold is enabled at 340MB
> INFO 09:30:31,478 Initializing key cache with capacity of 51 MBs.
> INFO 09:30:31,573 Scheduling key cache save to each 14400 seconds (going to 
> save all keys).
> INFO 09:30:31,576 Initializing row cache with capacity of 0 MBs and provider 
> org.apache.cassandra.cache.SerializingCacheProvider
> INFO 09:30:31,614 Scheduling row cache save to each 0 seconds (going to save 
> all keys).
> INFO 09:30:32,101 Couldn't detect any schema definitions in local storage.
> INFO 09:30:32,101 Found table data in data directories. Consider using the 
> CLI to define your schema.
> INFO 09:30:32,174 No commitlog files found; skipping replay
> INFO 09:30:32,235 Cassandra version: 1.1.0
> INFO 09:30:32,237 Thrift API version: 19.30.0
> INFO 09:30:32,252 CQL supported versions: 2.0.0,3.0.0-beta1 (default: 2.0.0)
> INFO 09:30:32,589 Loading persisted ring state
> INFO 09:30:32,592 Starting up server gossip
> INFO 09:30:32,634 Enqueuing flush of Memtable-LocationInfo@24914065(126/157 
> serialized/live bytes, 3 ops)
> INFO 09:30:32,635 Writing Memtable-LocationInfo@24914065(126/157 
> serialized/live bytes, 3 ops)
> INFO 09:30:33,193 Completed flushing 
> D:\Cassandra\apache-cassandra-1.1.0\storage\data\system\LocationInfo\system-LocationInfo-hc-1-Data.db
>  (234 bytes)
> INFO 09:30:33,254 Starting Messaging Service on port 7000
> INFO 09:30:33,262 This node will not auto bootstrap because it is configured 
> to be a seed node.
> INFO 09:30:33,271 Saved token not found. Using 0 from configuration
> INFO 09:30:33,274 Enqueuing flush of Memtable-LocationInfo@5474676(38/47 
> serialized/live bytes, 2 ops)
> INFO 09:30:33,275 Writing Memtable-LocationInfo@5474676(38/47 serialized/live 
> bytes, 2 ops)
> INFO 09:30:33,507 Completed flushing 
> D:\Cassandra\apache-cassandra-1.1.0\storage\data\system\

Re: Cassandra search performance

2012-04-29 Thread Maxim Potekhin
Jason,

I'm using plenty of secondary indexes with no problem at all.

Looking at your example,as I think you understand, you forgo indexes by
combining two conditions in one query, thinking along the lines of what is
often done in RDBMS. A scan is expected in this case, and there is no
magic to avoid it.

However, if this query is important, you can easily index on two conditions,
using a composite type (look it up), or string concatenation for quick and
easy solution. Which is, you _create an additional column_ which contains a
combination of the two you want to use in a query. Then index on it.
Problem solved.
The composite solution is more elegant but what I describe works in
simple cases.
It works for me.

Maxim


On 4/25/2012 10:45 AM, Jason Tang wrote:
> 1.0.8
>
> 在 2012年4月25日 下午10:38,Philip Shon  >写 道:
>
> what version of cassandra are you using. I found a big performance
> hit when querying on the secondary index.
>
> I came across this bug in versions prior to 1.1
>
> https://issues.apache.org/jira/browse/CASSANDRA-3545
>
> Hope that helps.
>
> 2012/4/25 Jason Tang  >
>
> And I found, if I only have the search condition "status", it
> only scan 200 records.
>
> But if I combine another condition "partition" then it scan
> all records because "partition" condition match all records.
>
> But combine with other condition such as "userName", even all
> "userName" is same in the 1,000,000 records, it only scan 200
> records.
>
> So it impacted by scan execution plan, if we have several
> search conditions, how it works? Do we have the similar
> execution plan in Cassandra?
>
>
> 在 2012年4月25日 下午9:18,Jason Tang  >写 道:
>
> Hi
>
> We have the such CF, and use secondary index to search for
> simple data "status", and among 1,000,000 row records, we
> have 200 records with status we want.
>
> But when we start to search, the performance is very poor,
> and check with the command "./bin/nodetool -h localhost -p
> 8199 cfstats" , Cassandra read 1,000,000 records, and
> "Read Latency" is 0.2 ms, so totally it used 200 seconds.
>
> It use lots of CPU, and check the stack, all thread in
> Cassandra is read from socket.
>
> So I wonder, how to really use index to find the 200
> records instead of scan all rows. (Supper Column?)
>
> /ColumnFamily: queue/
> /Key Validation Class:
> org.apache.cassandra.db.marshal.BytesType/
> /Default column value validator:
> org.apache.cassandra.db.marshal.BytesType/
> /Columns sorted by: org.apache.cassandra.db.marshal.BytesType/
> /Row cache size / save period in seconds / keys to save :
> 0.0/0/all/
> /Row Cache Provider:
> org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider/
> /Key cache size / save period in seconds: 0.0/0/
> /GC grace seconds: 0/
> /Compaction min/max thresholds: 4/32/
> /Read repair chance: 0.0/
> /Replicate on write: false/
> /Bloom Filter FP chance: default/
> /Built indexes: [queue.idxStatus]/
> /Column Metadata:/
> /Column Name: status (737461747573)/
> /Validation Class: org.apache.cassandra.db.marshal.AsciiType/
> /Index Name: idxStatus/
> /Index Type: KEYS/
> /
> /
> BRs
> //Jason
>
>
>
>



Cassandra backup question regarding commitlogs

2012-04-29 Thread Roshan
Hi

Currently I am taking daily snapshot on my keyspace in production and
already enable the incremental backups as well.

According to the documentation, the incremental backup option will create an
hard-link to the backup folder when new sstable is flushed. Snapshot will
copy all the data/index/etc. files to a new folder.

Question:
What will happen (with enabling the incremental backup) when crash (due to
any reason) the Cassandra before flushing the data as a SSTable (inserted
data still in commitlog). In this case how can I backup/restore data?

Do I need to backup the commitlogs as well and and replay during the server
start to restore the data in commitlog files?

Thanks. 

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-backup-question-regarding-commitlogs-tp7511918.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Cassandra backup queston regarding commitlogs

2012-04-29 Thread Roshan
Tamar

Please don't jump to other users discussions. If you want to ask any issue,
create a new one, please.

Thanks. 


--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-backup-question-regarding-commitlogs-tp7508823p7511913.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Server Side Logic/Script - Triggers / StoreProc

2012-04-29 Thread Maxim Potekhin
About a year ago I started getting a strange feeling that
the noSQL community is busy re-creating RDBMS in minute detail.

Why did we bother in the first place?

Maxim



On 4/27/2012 6:49 PM, Data Craftsman wrote:
> Howdy,
>
> Some Polyglot Persistence(NoSQL) products started support server side
> scripting, similar to RDBMS store procedure.
> E.g. Redis Lua scripting.
>
> I wish it is Python when Cassandra has the server side scripting feature.
>
> FYI,
>
> http://antirez.com/post/250
>
> http://nosql.mypopescu.com/post/19949274021/alchemydb-an-integrated-graphdb-rdbms-kv-store
>
> "server side scripting support is an extremely powerful tool. Having
> processing close to data (i.e. data locality) is a well known
> advantage, ..., it can open the doors to completely new features."
>
> Thanks,
>
> Charlie (@mujiang) 一个 木匠
> ===
> Data Architect Developer
> http://mujiang.blogspot.com
>
> On Sun, Apr 22, 2012 at 9:35 AM, Brian O'Neill  wrote:
>> Praveen,
>>
>> We are certainly interested. To get things moving we implemented an add-on
>> for Cassandra to demonstrate the viability (using AOP):
>> https://github.com/hmsonline/cassandra-triggers
>>
>> Right now the implementation executes triggers asynchronously, allowing you
>> to implement a java interface and plugin your own java class that will get
>> called for every insert.
>>
>> Per the discussion on 1311, we intend to extend our proof of concept to be
>> able to invoke scripts as well.  (minimally we'll enable javascript, but
>> we'll probably allow for ruby and groovy as well)
>>
>> -brian
>>
>> On Apr 22, 2012, at 12:23 PM, Praveen Baratam wrote:
>>
>> I found that Triggers are coming in Cassandra 1.2
>> (https://issues.apache.org/jira/browse/CASSANDRA-1311) but no mention of any
>> StoreProc like pattern.
>>
>> I know this has been discussed so many times but never met with
>> any initiative. Even Groovy was staged out of the trunk.
>>
>> Cassandra is great for logging and as such will be infinitely more useful if
>> some logic can be pushed into the Cassandra cluster nearer to the location
>> of Data to generate a materialized view useful for applications.
>>
>> Server Side Scripts/Routines in Distributed Databases could soon prove to be
>> the differentiating factor.
>>
>> Let me reiterate things with a use case.
>>
>> In our application we store time series data in wide rows with TTL set on
>> each point to prevent data from growing beyond acceptable limits. Still the
>> data size can be a limiting factor to move all of it from the cluster node
>> to the querying node and then to the application via thrift for processing
>> and presentation.
>>
>> Ideally we should process the data on the residing node and pass only the
>> materialized view of the data upstream. This should be trivial if Cassandra
>> implements some sort of server side scripting and CQL semantics to call it.
>>
>> Is anybody else interested in a similar feature? Is it being worked on? Are
>> there any alternative strategies to this problem?
>>
>> Praveen
>>
>>
>>
>> --
>> Brian ONeill
>> Lead Architect, Health Market Science (http://healthmarketscience.com)
>> mobile:215.588.6024
>> blog: http://weblogs.java.net/blog/boneill42/
>> blog: http://brianoneill.blogspot.com/
>>
>
>



Re: Building SSTables with SSTableSimpleUnsortedWriter

2012-04-29 Thread Benoit Perroud
big buffer size will use more Heap memory at creation of the tables.
Not sure impact on server side, but shouldn't be a big difference. I
personally use 512Mb.





2012/4/28 sj.climber :
> Can anyone comment on best practices for setting the buffer size used by
> SSTableSimpleUnsortedWriter?  I'm presently using 100MB, but for some of the
> larger column families I'm working with, this can result in hundreds of
> SSTables.  After streaming to Cassandra via sstableloader, there's a fair
> amount of compaction work to do.
>
> What are the benefits and consequences of going with a higher buffer size?
>
> Thanks!
>
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Building-SSTables-with-SSTableSimpleUnsortedWriter-tp7507756.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.



-- 
sent from my Nokia 3210


Re: nodetool repair cassandra 0.8.4 HELP!!!

2012-04-29 Thread Raj N
I tried it on 1 column family. I believe there is a bug in 0.8* where
repair ignores the cf. I tried this multiple times on different nodes.
Every time the disk util was going uo to 80% on a 500 GB disk. I would
eventually kill the repair. I only have 60GB worth data. I see this JIRA -

https://issues.apache.org/jira/browse/CASSANDRA-2324

But that says it was fixed in 0.8 beta. Is this still broken in 0.8.4?

I also don't understand why the data was inconsistent in the first place. I
read and write at LOCAL_QUORUM.

Thanks
-Raj

On Sun, Apr 29, 2012 at 2:06 AM, Watanabe Maki wrote:

> You should run repair. If the disk space is the problem, try to cleanup
> and major compact before repair.
> You can limit the streaming data by running repair for each column family
> separately.
>
> maki
>
> On 2012/04/28, at 23:47, Raj N  wrote:
>
> > I have a 6 node cassandra cluster DC1=3, DC2=3 with 60 GB data on each
> node. I was bulk loading data over the weekend. But we forgot to turn off
> the weekly nodetool repair job. As a result, repair was interfering when we
> were bulk loading data. I canceled repair by restarting the nodes. But
> unfortunately after the restart it looks like I dont have any data on those
> nodes when I use list on cassandra-cli. I ran repair on one of the effected
> nodes, but repair seems to be taking forever. Disk space has almost
> tripled. I stopped the repair again in fear of running out of disk space.
> After restart, the disk space is at 50% where as the good nodes are at 25%.
> How should I proceed from here.  When I run list on cassandra-cli I do see
> data on the effected node. But how can I be sure I have all the data.
> Should I run repair again. Should I cleanup the disk by clearing snapshots.
> Or should I just drop column families and bulk load the data again?
> >
> > Thanks
> > -Raj
>