Re: OOM at Bootstrap Time

2014-10-30 Thread Maxime
I will give a shot adding the logging. I've tried some experiments and I have no clue what could be happening anymore: I tried setting all nodes to a streamthroughput of 1 except 1, to see if somehow it was getting overloaded by too many streams coming in at once, nope. I went through the source

Re: OOM at Bootstrap Time

2014-10-30 Thread Maxime
I've been trying to go through the logs but I can't say I understand very well the details: INFO [SlabPoolCleaner] 2014-10-30 19:20:18,446 ColumnFamilyStore.java:856 - Enqueuing flush of loc: 7977119 (1%) on-heap, 0 (0%) off-heap DEBUG [SharedPool-Worker-22] 2014-10-30 19:20:18,446

Re: OOM at Bootstrap Time

2014-10-30 Thread Maxime
Well, the answer was Secondary indexes. I am guessing they were corrupted somehow. I dropped all of them, cleanup, and now nodes are bootstrapping fine. On Thu, Oct 30, 2014 at 3:50 PM, Maxime maxim...@gmail.com wrote: I've been trying to go through the logs but I can't say I understand very

Re: OOM at Bootstrap Time

2014-10-29 Thread DuyHai Doan
Some ideas: 1) Put on DEBUG log on the joining node to see what is going on in details with the stream with 1500 files 2) Check the stream ID to see whether it's a new stream or an old one pending On Wed, Oct 29, 2014 at 2:21 AM, Maxime maxim...@gmail.com wrote: Doan, thanks for the tip, I

Re: OOM at Bootstrap Time

2014-10-28 Thread Maxime
Doan, thanks for the tip, I just read about it this morning, just waiting for the new version to pop up on the debian datastax repo. Michael, I do believe you are correct in the general running of the cluster and I've reset everything. So it took me a while to reply, I finally got the SSTables

Re: OOM at Bootstrap Time

2014-10-27 Thread Laing, Michael
Again, from our experience w 2.0.x: Revert to the defaults - you are manually setting heap way too high IMHO. On our small nodes we tried LCS - way too much compaction - switch all CFs to STCS. We do a major rolling compaction on our small nodes weekly during less busy hours - works great. Be

Re: OOM at Bootstrap Time

2014-10-27 Thread DuyHai Doan
Tombstones will be a very important issue for me since the dataset is very much a rolling dataset using TTLs heavily. -- You can try the new DateTiered compaction strategy ( https://issues.apache.org/jira/browse/CASSANDRA-6602) released on 2.1.1 if you have a time series data model to eliminate

Re: OOM at Bootstrap Time

2014-10-26 Thread DuyHai Doan
Hello Maxime Can you put the complete logs and config somewhere ? It would be interesting to know what is the cause of the OOM. On Sun, Oct 26, 2014 at 3:15 AM, Maxime maxim...@gmail.com wrote: Thanks a lot that is comforting. We are also small at the moment so I definitely can relate with

Re: OOM at Bootstrap Time

2014-10-26 Thread Maxime
I've emailed you a raw log file of an instance of this happening. I've been monitoring more closely the timing of events in tpstats and the logs and I believe this is what is happening: - For some reason, C* decides to provoke a flush storm (I say some reason, I'm sure there is one but I have

Re: OOM at Bootstrap Time

2014-10-26 Thread DuyHai Doan
Hello Maxime Increasing the flush writers won't help if your disk I/O is not keeping up. I've had a look into the log file, below are some remarks: 1) There are a lot of SSTables on disk for some tables (events for example, but not only). I've seen that some compactions are taking up to 32

Re: OOM at Bootstrap Time

2014-10-26 Thread Maxime
Thank you very much for your reply. This is a deeper interpretation of the logs than I can do at the moment. Regarding 2) it's a good assumption on your part but in this case, non-obviously the loc table's primary key is actually not id, the scheme changed historically which has led to this odd

Re: OOM at Bootstrap Time

2014-10-26 Thread Jonathan Haddad
If the issue is related to I/O, you're going to want to determine if you're saturated. Take a look at `iostat -dmx 1`, you'll see avgqu-sz (queue size) and svctm, (service time).The higher those numbers are, the most overwhelmed your disk is. On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan

Re: OOM at Bootstrap Time

2014-10-26 Thread DuyHai Doan
Should doing a major compaction on those nodes lead to a restructuration of the SSTables? -- Beware of the major compaction on SizeTiered, it will create 2 giant SSTables and the expired/outdated/tombstone columns in this big file will be never cleaned since the SSTable will never get a chance to

Re: OOM at Bootstrap Time

2014-10-26 Thread Maxime
Hmm, thanks for the reading. I initially followed some (perhaps too old) maintenance scripts, which included weekly 'nodetool compact'. Is there a way for me to undo the damage? Tombstones will be a very important issue for me since the dataset is very much a rolling dataset using TTLs heavily.

OOM at Bootstrap Time

2014-10-25 Thread Maxime
Hello, I've been trying to add a new node to my cluster ( 4 nodes ) for a few days now. I started by adding a node similar to my current configuration, 4 GB or RAM + 2 Cores on DigitalOcean. However every time, I would end up getting OOM errors after many log entries of the type: INFO

Re: OOM at Bootstrap Time

2014-10-25 Thread Laing, Michael
Since no one else has stepped in... We have run clusters with ridiculously small nodes - I have a production cluster in AWS with 4GB nodes each with 1 CPU and disk-based instance storage. It works fine but you can see those little puppies struggle... And I ran into problems such as you

Re: OOM at Bootstrap Time

2014-10-25 Thread Maxime
Thanks a lot that is comforting. We are also small at the moment so I definitely can relate with the idea of keeping small and simple at a level where it just works. I see the new Apache version has a lot of fixes so I will try to upgrade before I look into downgrading. On Saturday, October 25,