Re: DELETE does not delete :)

2013-10-07 Thread Michał Michalski

W dniu 07.10.2013 08:02, Alexander Shutyaev pisze:

* We have not modified any *consistency settings* in our app, so I assume
we have the *default QUORUM* (2 out of 3 in our case) consistency *for
reads and writes*.


cqlsh uses ONE by default, pycassa uses ONE by default too. I have no 
experience with DataStax's Java driver, but I'd assume it uses ONE by 
default too.


Quick grep in the souce tells me this too:

./driver-core/src/main/java/com/datastax/driver/core/QueryOptions.java: 
* The default consistency level for queries: {@code 
ConsistencyLevel.ONE}.
public static final ConsistencyLevel DEFAULT_CONSISTENCY_LEVEL = 
ConsistencyLevel.ONE;




M.



Re: Disappearing index data.

2013-10-07 Thread Michał Michalski
I had similar issue (reported many times here, there's also a JIRA 
issue, but people reporting this problem were unable to reproduce it).


What I can say is that for me the solution was to run major compaction 
on the index CF via JMX. To be clear - we're not talking about 
compacting the CF that IS indexed (your CF), but the internal 
Cassandra's one, which is responsible for storing index data.


MBean you should look for looks like this:

org.apache.cassandra.db:type=IndexColumnFamilies,keyspace=KS,columnfamily=CF.IDX

M.

W dniu 07.10.2013 15:22, Tom van den Berge pisze:

On a 2-node cluster with replication factor 2, I have a column family with
an index on one of the columns.

Every now and then, I notice that a lookup of the record through the index
on node 1 produces the record, but the same lookup on node 2 does not! If I
do a lookup by row key, the record is found, and the indexed value is there.


So as far as I can tell, the index on one of the nodes looses values, and
is no longer in sync with the other node, even though the replication
factor requires it. I typically repair these issues by storing the indexed
column value again.

The indexed data is static data; it doesn't change.

I'm running cassandra 1.2.3. I'm running a nodetool repair on each node
every day (although this does not fix this problem).

This problem worries me a lot. I don't have a clue about the cause of it.
Any help would be greatly appreciated.



Tom





Re: Cassandra Heap Size for data more than 1 TB

2013-10-03 Thread Michał Michalski
I was experimenting with 128 vs. 512 some time ago and I was unable to 
see any difference in terms of performance. I'd probably check 1024 too, 
but we migrated to 1.2 and heap space was not an issue anymore.


M.

W dniu 02.10.2013 16:32, srmore pisze:

I changed my index_interval from 128 to index_interval: 128 to 512, does it
make sense to increase more than this ?


On Wed, Oct 2, 2013 at 9:30 AM, cem cayiro...@gmail.com wrote:


Have a look to index_interval.

Cem.


On Wed, Oct 2, 2013 at 2:25 PM, srmore comom...@gmail.com wrote:


The version of Cassandra I am using is 1.0.11, we are migrating to 1.2.X
though. We had tuned bloom filters (0.1) and AFAIK making it lower than
this won't matter.

Thanks !


On Tue, Oct 1, 2013 at 11:54 PM, Mohit Anchlia mohitanch...@gmail.comwrote:


Which Cassandra version are you on? Essentially heap size is function of
number of keys/metadata. In Cassandra 1.2 lot of the metadata like bloom
filters were moved off heap.


On Tue, Oct 1, 2013 at 9:34 PM, srmore comom...@gmail.com wrote:


Does anyone know what would roughly be the heap size for cassandra with
1TB of data ? We started with about 200 G and now on one of the nodes we
are already on 1 TB. We were using 8G of heap and that served us well up
until we reached 700 G where we started seeing failures and nodes flipping.

With 1 TB of data the node refuses to come back due to lack of memory.
needless to say repairs and compactions takes a lot of time. We upped the
heap from 8 G to 12 G and suddenly everything started moving rapidly i.e.
the repair tasks and the compaction tasks. But soon (in about 9-10 hrs) we
started seeing the same symptoms as we were seeing with 8 G.

So my question is how do I determine what is the optimal size of heap
for data around 1 TB ?

Following are some of my JVM settings

-Xms8G
-Xmx8G
-Xmn800m
-XX:NewSize=1200M
XX:MaxTenuringThreshold=2
-XX:SurvivorRatio=4

Thanks !














Re: Cassandra Heap Size for data more than 1 TB

2013-10-03 Thread Michał Michalski
Currently we have 480-520 GB of data per node, so it's not even close to 
1TB, but I'd bet that reaching 700-800GB shouldn't be a problem in terms 
of everyday performance - heap space is quite low, no GC issues etc. 
(to give you a comparison: when working on 1.1 and having ~300-400GB per 
node we had a huge problem with bloom filters and heap space, so we had 
to bump it to 12-16 GB; on 1.2 it's not an issue anymore).


However, our main concern is the time that we'll need to rebuild broken 
node, so we are going to extend the cluster soon to avoid such problems 
and keep our nodes about 50% smaller.


M.


W dniu 03.10.2013 15:02, srmore pisze:

Thanks Mohit and Michael,
That's what I thought. I have tried all the avenues, will give ParNew a
try. With the 1.0.xx I have issues when data sizes go up, hopefully that
will not be the case with 1.2.

Just curious, has anyone tried 1.2 with large data set, around 1 TB ?


Thanks !


On Thu, Oct 3, 2013 at 7:20 AM, Michał Michalski mich...@opera.com wrote:


I was experimenting with 128 vs. 512 some time ago and I was unable to see
any difference in terms of performance. I'd probably check 1024 too, but we
migrated to 1.2 and heap space was not an issue anymore.

M.

W dniu 02.10.2013 16:32, srmore pisze:

  I changed my index_interval from 128 to index_interval: 128 to 512, does

it
make sense to increase more than this ?


On Wed, Oct 2, 2013 at 9:30 AM, cem cayiro...@gmail.com wrote:

  Have a look to index_interval.


Cem.


On Wed, Oct 2, 2013 at 2:25 PM, srmore comom...@gmail.com wrote:

  The version of Cassandra I am using is 1.0.11, we are migrating to 1.2.X

though. We had tuned bloom filters (0.1) and AFAIK making it lower than
this won't matter.

Thanks !


On Tue, Oct 1, 2013 at 11:54 PM, Mohit Anchlia mohitanch...@gmail.com

wrote:


  Which Cassandra version are you on? Essentially heap size is function

of
number of keys/metadata. In Cassandra 1.2 lot of the metadata like
bloom
filters were moved off heap.


On Tue, Oct 1, 2013 at 9:34 PM, srmore comom...@gmail.com wrote:

  Does anyone know what would roughly be the heap size for cassandra

with
1TB of data ? We started with about 200 G and now on one of the nodes
we
are already on 1 TB. We were using 8G of heap and that served us well
up
until we reached 700 G where we started seeing failures and nodes
flipping.

With 1 TB of data the node refuses to come back due to lack of memory.
needless to say repairs and compactions takes a lot of time. We upped
the
heap from 8 G to 12 G and suddenly everything started moving rapidly
i.e.
the repair tasks and the compaction tasks. But soon (in about 9-10
hrs) we
started seeing the same symptoms as we were seeing with 8 G.

So my question is how do I determine what is the optimal size of heap
for data around 1 TB ?

Following are some of my JVM settings

-Xms8G
-Xmx8G
-Xmn800m
-XX:NewSize=1200M
XX:MaxTenuringThreshold=2
-XX:SurvivorRatio=4

Thanks !



















Re: Recommended hardware

2013-09-24 Thread Michał Michalski

Hi Tim,

Not sure if you've seen this, but I'd start from DataStax's documentation:

http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/architecture/architecturePlanningAbout_c.html?pagename=docsversion=1.2file=cluster_architecture/cluster_planning

Taking a look at the mailinglist's archive might be useful too.

M.

W dniu 23.09.2013 18:17, Tim Dunphy pisze:

Hello,

I am running Cassandra 2.0 on a 2gb memory 10 gb HD in a virtual cloud 
environment. It's supporting a php application running on the same node.

Mostly this instance runs smoothly but runs low on memory. Depending on how 
much the site is used, the VM will swap out sometimes excessively.

I realize this setup may not be enough to support a cassandra instance.

I was wondering if there were any recommended hardware specs someone could 
point me to for both physical and virtual (cloud) type environments.

Thank you,
Tim
Sent from my iPhone





Re: Row size in cfstats vs cfhistograms

2013-09-19 Thread Michał Michalski
I believe the reason is that cfhistograms tells you about the sizes of 
the rows returned by given node in a response to the read request, while 
cfstats tracks the largest row stored on given node.


M.

W dniu 19.09.2013 11:31, Rene Kochen pisze:

Hi all,

I use Cassandra 1.0.11

If I do cfstats for a particular column family, I see a Compacted row
maximum size of 43388628

However, when I do a cfhistograms I do not see such a big row in the Row
Size column. The biggest row there is 126934.

Can someone explain this?

Thanks!

Rene





Re: Why don't you start off with a “single small” Cassandra server as you usually do it with MySQL?

2013-09-18 Thread Michał Michalski

You might be interested in this:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201308.mbox/%3ccaeqobhpav25pcgjfwbkmd1rzxvrif94e6lpybpj3mu_bqn9...@mail.gmail.com%3E

M.

W dniu 18.09.2013 15:34, Ertio Lew pisze:

For any website just starting out, the load initially is minimal  grows
with a slow pace initially. People usually start with their MySQL based
sites with a single server(***that too a VPS not a dedicated server)
running as both app server as well as DB server  usually get too far with
this setup  only as they feel the need they separate the DB from the app
server giving it a separate VPS server. This is what a start up expects the
things to be while planning about resources procurement.

But so far what I have seen, it's something very different with Cassandra.
People usually recommend starting out with atleast a 3 node cluster, (on
dedicated servers) with lots  lots of RAM. 4GB or 8GB RAM is what they
suggest to start with. So is it that Cassandra requires more hardware
resources in comparison to MySQL, for a website to deliver similar
performance, serve similar load/ traffic  same amount of data. I
understand about higher storage requirements of Cassandra due to
replication but what about other hardware resources ?

Can't we start off with Cassandra based apps just like MySQL. Starting with
1 or 2 VPS  adding more whenever there's a need ?

I don't want to compare apples with oranges. I just want to know how much
more dangerous situation I may be in when I start out with a single node
VPS based cassandra installation Vs a single node VPS based MySQL
installation. Difference between these two situations. Are cassandra
servers more prone to be unavailable than MySQL servers ? What is bad if I
put tomcat too along with Cassandra as people use LAMP stack on single
server.

-


*This question is also posted at StackOverflow
herehttp://stackoverflow.com/questions/18462530/why-dont-you-start-off-with-a-single-small-cassandra-server-as-you-usually

has an open bounty worth +50 rep.*





Re: cassandra disk access

2013-08-07 Thread Michał Michalski



2. when cassandra lookups a key in sstable (assuming bloom-filter and other
stuff failed, also assuming the key is located in this single sstable),
cassandra DO NOT USE sequential I/O. She probably will read the
hash-table slot or similar structure, then cassandra will do another disk
seek in order to get the value (and probably the key). Also probably there
will need another seek, if there is key collision there will need
additional seeks.


It will use the Index Sample (RAM) first, then it will use full Index 
(disk) and finally it will read data from SSTable (disk). There's no 
such thing like collision in this case.



3. once the data (e.g. the row) is located, a sequential read for entire
row will occur. (Once again I assume there is single well compacted
sstable). Also if disk is not fragmented, the data will be placed on disk
sectors one after the other.


Yes, this is how I understand it too.

M.



Re: cassandra disk access

2013-08-07 Thread Michał Michalski
I'm not sure how accurate it is (it's from 2011, one of its sources is 
from 2010), but I'm pretty sure it's more or less OK:


http://blog.csdn.net/firecoder/article/details/7019435

M.

W dniu 07.08.2013 10:34, Nikolay Mihaylov pisze:

thanks

It will use the Index Sample (RAM) first, then it will use full Index
(disk) and finally it will read data from SSTable (disk). There's no such
thing like collision in this case.

so it still have 2 seeks :)

where I can see the internal structure of the sstable i tried to find it
documented but was unable to find anything ?




On Wed, Aug 7, 2013 at 11:27 AM, Michał Michalski mich...@opera.com wrote:



  2. when cassandra lookups a key in sstable (assuming bloom-filter and

other
stuff failed, also assuming the key is located in this single sstable),
cassandra DO NOT USE sequential I/O. She probably will read the
hash-table slot or similar structure, then cassandra will do another disk
seek in order to get the value (and probably the key). Also probably there
will need another seek, if there is key collision there will need
additional seeks.



It will use the Index Sample (RAM) first, then it will use full Index
(disk) and finally it will read data from SSTable (disk). There's no such
thing like collision in this case.


  3. once the data (e.g. the row) is located, a sequential read for entire

row will occur. (Once again I assume there is single well compacted
sstable). Also if disk is not fragmented, the data will be placed on disk
sectors one after the other.



Yes, this is how I understand it too.

M.








Re: memtable overhead

2013-07-23 Thread Michał Michalski
Not sure how up-to-date this info is, but from some discussions that 
happened here long time ago I remember that a minimum of 1MB per 
Memtable needs to be allocated.


The other constraint here is memtable_total_space_in_mb setting in 
cassandra.yaml, which you might wish to tune when having a lot of CFs.


M.

W dniu 23.07.2013 07:12, Darren Smythe pisze:

The way weve gone about our data models has resulted in lots of column
families and just looking for guidelines about how much space each column
table adds.

TIA


On Sun, Jul 21, 2013 at 11:19 PM, Darren Smythe darren1...@gmail.comwrote:


Hi,

How much overhead (in heap MB) does an empty memtable use? If I have many
column families that aren't written to often, how much memory do these take
up?

TIA

-- Darren







Re: Cassandra 2 vs Java 1.6

2013-07-22 Thread Michał Michalski
I believe it won't run on 1.6. Java 1.7 is required to compile C* 2.0+ 
and once it's done, you cannot run it using Java 1.6 (this is what 
Unsupported major.minor version error tells you about; java version 50 
is 1.6 and 51 is 1.7).


M.

W dniu 22.07.2013 10:06, Andrew Cobley pisze:

I know it was decided to drop the requirement for Java 1.6 for cassandra some time 
ago, but my question is should 
2.0.0-beta1http://www.apache.org/dyn/closer.cgi?path=/cassandra/2.0.0/apache-cassandra-2.0.0-beta1-bin.tar.gz
 run under java 1.6 at all ?  I tried and got the following error:


macaroon:bin administrator$ Exception in thread main 
java.lang.UnsupportedClassVersionError: org/apache/cassandra/service/CassandraDaemon : 
Unsupported major.minor version 51.0
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
 at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

macaroon:bin administrator$ java -version
java version 1.6.0_43
Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-10M4203)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
macaroon:bin administrator$

It's fine by me if thats the case !

Andy


The University of Dundee is a registered Scottish Charity, No: SC015096





Re: is there a key to sstable index file?

2013-07-18 Thread Michał Michalski
SSTables are immutable - once they're written to disk, they cannot be 
changed.


On read C* checks *all* SSTables [1], but to make it faster, it uses 
Bloom Filters, that can tell you if a row is *not* in a specific 
SSTable, so you don't have to read it at all. However, *if* you read it 
in case you have to, you don't read a whole SSTable - there's an 
in-memory Index Sample, that is used for binary search and returning 
only a (relatively) small block of real (full, on-disk) index, which you 
have to scan  to find a place to retrieve the data from SSTable. 
Additionally you have a KeyCache to make reads faster - it points 
location of data in SSTable, so you don't have to touch Index Sample and 
Index at all.


Once C* retrieves all data parts (including the Memtable part), 
timestamps are used to find the most recent version of data.


[1] I believe that it's not true for all cases, as I saw a piece of code 
somewhere in the source, that starts checking SSTables in order from the 
newest to the oldest one (in terms of data timestamps - AFAIR SSTable 
MetaData stores info about smallest and largest timestamp in SSTable), 
and once the newest data for all columns are retrieved (assuming that 
schema is defined), retrieving data stops and older SSTables are not 
checked. If someone could confirm that it works this way and it's not 
something that I saw in my dream and now believe it's real, I'd be glad ;-)


W dniu 17.07.2013 22:58, S Ahmed pisze:

Since SSTables are mutable, and they are ordered, does this mean that there
is a index of key ranges that each SS table holds, and the value could be 1
more sstables that have to be scanned and then the latest one is chosen?

e.g. Say I write a value abc to CF1.  This gets stored in a sstable.

Then I write def to CF1, this gets stored in another sstable eventually.

How when I go to fetch the value, it has to scan 2 sstables and then figure
out which is the latest entry correct?

So is there an index of key's to sstables, and there can be 1 or more
sstables per key?

(This is assuming compaction hasn't occurred yet).





Re: is there a key to sstable index file?

2013-07-18 Thread Michał Michalski

Thanks! :-)

M.

W dniu 18.07.2013 08:42, Jean-Armel Luce pisze:

@Michal : look a this for the improvement of read performance  :
https://issues.apache.org/jira/browse/CASSANDRA-2498

Best regards.
Jean Armel


2013/7/18 Michał Michalski mich...@opera.com


SSTables are immutable - once they're written to disk, they cannot be
changed.

On read C* checks *all* SSTables [1], but to make it faster, it uses Bloom
Filters, that can tell you if a row is *not* in a specific SSTable, so you
don't have to read it at all. However, *if* you read it in case you have
to, you don't read a whole SSTable - there's an in-memory Index Sample,
that is used for binary search and returning only a (relatively) small
block of real (full, on-disk) index, which you have to scan  to find a
place to retrieve the data from SSTable. Additionally you have a KeyCache
to make reads faster - it points location of data in SSTable, so you don't
have to touch Index Sample and Index at all.

Once C* retrieves all data parts (including the Memtable part),
timestamps are used to find the most recent version of data.

[1] I believe that it's not true for all cases, as I saw a piece of code
somewhere in the source, that starts checking SSTables in order from the
newest to the oldest one (in terms of data timestamps - AFAIR SSTable
MetaData stores info about smallest and largest timestamp in SSTable), and
once the newest data for all columns are retrieved (assuming that schema is
defined), retrieving data stops and older SSTables are not checked. If
someone could confirm that it works this way and it's not something that I
saw in my dream and now believe it's real, I'd be glad ;-)

W dniu 17.07.2013 22:58, S Ahmed pisze:

  Since SSTables are mutable, and they are ordered, does this mean that

there
is a index of key ranges that each SS table holds, and the value could be
1
more sstables that have to be scanned and then the latest one is chosen?

e.g. Say I write a value abc to CF1.  This gets stored in a sstable.

Then I write def to CF1, this gets stored in another sstable eventually.

How when I go to fetch the value, it has to scan 2 sstables and then
figure
out which is the latest entry correct?

So is there an index of key's to sstables, and there can be 1 or more
sstables per key?

(This is assuming compaction hasn't occurred yet).










Re: manually removing sstable

2013-07-17 Thread Michał Michalski

Hi Aaron,

 * Tombstones will only be purged if all fragments of a row are in the 
SStable(s) being compacted.


According to my knowledge it's not necessarily true. In a specific case 
this patch comes into play:


https://issues.apache.org/jira/browse/CASSANDRA-4671

We could however purge tombstone if we know that the non-compacted 
sstables doesn't have any info that is older than the tombstones we're 
about to purge (since then we know that the tombstones we'll consider 
can't delete data in non compacted sstables).


M.

W dniu 12.07.2013 10:25, aaron morton pisze:

That sounds sane to me. Couple of caveats:

* Remember that Expiring Columns turn into Tombstones and can only be purged 
after TTL and gc_grace.
* Tombstones will only be purged if all fragments of a row are in the 
SStable(s) being compacted.

Cheers

-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/07/2013, at 10:17 PM, Theo Hultberg t...@iconara.net wrote:


a colleague of mine came up with an alternative solution that also seems to 
work, and I'd just like your opinion on if it's sound.

we run find to list all old sstables, and then use cmdline-jmxclient to run the 
forceUserDefinedCompaction function on each of them, this is roughly what we do 
(but with find and xargs to orchestrate it)

   java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 
org.apache.cassandra.db:type=CompactionManager 
forceUserDefinedCompaction=the_keyspace,db_file_name

the downside is that c* needs to read the file and do disk io, but the upside 
is that it doesn't require a restart. c* does a little more work, but we can 
schedule that during off-peak hours. another upside is that it feels like we're 
pretty safe from screwups, we won't accidentally remove an sstable with live 
data, the worst case is that we ask c* to compact an sstable with live data and 
end up with an identical sstable.

if anyone else wants to do the same thing, this is the full cron command:

0 4 * * * find /path/to/cassandra/data/the_keyspace_name -maxdepth 1 -type f -name 
'*-Data.db' -mtime +8 -printf 
forceUserDefinedCompaction=the_keyspace_name,\%P\n | xargs -t 
--no-run-if-empty java -jar /usr/local/share/java/cmdline-jmxclient-0.10.3.jar - 
localhost:7199 org.apache.cassandra.db:type=CompactionManager

just change the keyspace name and the path to the data directory.

T#


On Thu, Jul 11, 2013 at 7:09 AM, Theo Hultberg t...@iconara.net wrote:
thanks a lot. I can confirm that it solved our problem too.

looks like the C* 2.0 feature is perfect for us.

T#


On Wed, Jul 10, 2013 at 7:28 PM, Marcus Eriksson krum...@gmail.com wrote:
yep that works, you need to remove all components of the sstable though, not 
just -Data.db

and, in 2.0 there is this:
https://issues.apache.org/jira/browse/CASSANDRA-5228

/Marcus


On Wed, Jul 10, 2013 at 2:09 PM, Theo Hultberg t...@iconara.net wrote:
Hi,

I think I remember reading that if you have sstables that you know contain only 
data that whose ttl has expired, it's safe to remove them manually by stopping 
c*, removing the *-Data.db files and then starting up c* again. is this correct?

we have a cluster where everything is written with a ttl, and sometimes c* 
needs to compact over a 100 gb of sstables where we know ever has expired, and 
we'd rather just manually get rid of those.

T#










Re: Deletion use more space.

2013-07-16 Thread Michał Michalski
Deletion is not really removing data, but it's adding tombstones 
(markers) of deletion. They'll be later merged with existing data during 
compaction and - in the end (see: gc_grace_seconds) - removed, but by 
this time they'll take some space.


http://wiki.apache.org/cassandra/DistributedDeletes

M.

W dniu 16.07.2013 11:46, 杨辉强 pisze:

Hi, all:
   I use cassandra 1.2.4 and I have 4 nodes ring and use byte order partitioner.
   I had inserted about 200G data in the ring previous days.

   Today I write a program to scan the ring and then at the same time delete 
the items that are scanned.
   To my surprise, the cassandra cost more disk usage.

Anybody can tell me why? Thanks.





Re: too many open files

2013-07-15 Thread Michał Michalski
It doesn't tell you anything if file ends it with ic-###, except 
pointing out the SSTable version it uses (ic in this case).


Files related to secondary index contain something like this in the 
filename: KS-CF.IDX-NAME, while in regular CFs do not contain 
any dots except the one just before file extension.


M.

W dniu 15.07.2013 09:38, Paul Ingalls pisze:

Also, looking through the log, it appears a lot of the files end with ic- 
which I assume is associated with a secondary index I have on the table.  Are 
secondary indexes really expensive from a file descriptor standpoint?  That 
particular table uses the default compaction scheme...

On Jul 15, 2013, at 12:00 AM, Paul Ingalls paulinga...@gmail.com wrote:


I have one table that is using leveled.  It was set to 10MB, I will try 
changing it to 256MB.  Is there a good way to merge the existing sstables?

On Jul 14, 2013, at 5:32 PM, Jonathan Haddad j...@jonhaddad.com wrote:


Are you using leveled compaction?  If so, what do you have the file size set 
at?  If you're using the defaults, you'll have a ton of really small files.  I 
believe Albert Tobey recommended using 256MB for the table sstable_size_in_mb 
to avoid this problem.


On Sun, Jul 14, 2013 at 5:10 PM, Paul Ingalls paulinga...@gmail.com wrote:
I'm running into a problem where instances of my cluster are hitting over 450K 
open files.  Is this normal for a 4 node 1.2.6 cluster with replication factor 
of 3 and about 50GB of data on each node?  I can push the file descriptor limit 
up, but I plan on having a much larger load so I'm wondering if I should be 
looking at something else….

Let me know if you need more info…

Paul





--
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade









Re: Restart node = hinted handoff flood

2013-07-05 Thread Michał Michalski

My blind guess is: https://issues.apache.org/jira/browse/CASSANDRA-5179

In our case the only sensible solution was to pause hints delivery and 
disable storing them (both done with a nodetool: pausehandoff and 
disablehandoff). Once they TTL'd (3 hours by default I believe?) I 
turned HH on again and started to repair. However, problem has returned 
on the next day, so I had to do a quick C* upgrade with the version 
having this patch applied (we use a self-built 1.2.1 with a few 
additional patches applied).


M.

W dniu 04.07.2013 18:41, Alain RODRIGUEZ pisze:

The point is that there is no way, afaik, to limit the speed of these
Hinted Handoff since it's not a stream like repair or bootstrap, no way
either to keep the node out of the ring during the time it is receiving
hints since hints and normal traffic both go through gossip protocol on
port 7000.

How to avoid this Hinted Handoff flood on returning nodes ?

Alain


2013/7/4 Alain RODRIGUEZ arodr...@gmail.com


Hi,

Using C*1.2.2 12 EC2 xLarge cluster.

When I restart a node, if it spend a few minutes down, when I bring it up,
all the cpu are blocked at 100%, even once compactions are disabled,
inducing a very big and intolerable latency in my app. I suspect Hinted
Handoff to be the cause of this. disabling gossip fix the problem, enabling
it again brings the latency back (with a lot of gc, dropped messages...).

Is there a way to disable HH ? Are they responsible for this issue ?

I currently have this node down, any fast insight would be appreciated.

Alain







Re: CorruptBlockException - recover?

2013-07-05 Thread Michał Michalski
I think I'd try removing broken SSTables (when node is down) and 
running repair then.


M.

W dniu 05.07.2013 09:10, Jan Kesten pisze:

Hi,

i tried to scrub the keyspace - but with no success either, the process
threw an exception when hitting the corrupt block and stopped then. I
will rebootstrap the node :-)

Thanks anyways,
Jan

On 03.07.2013 19:10, Glenn Thompson wrote:

For what its worth.  I did this when I had this problem.  It didn't
work out for me.  Perhaps I did something wrong.


On Wed, Jul 3, 2013 at 11:06 AM, Robert Coli rc...@eventbrite.com
mailto:rc...@eventbrite.com wrote:

On Wed, Jul 3, 2013 at 7:04 AM, ifjke j.kes...@enercast.de
mailto:j.kes...@enercast.de wrote:

I found that one of my cassandra nodes died recently (machine
hangs). I restarted the node an run a nodetool repair, while
running it has thrown a org.apache.cassandra.io
http://org.apache.cassandra.io.compress.CorruptBlockException.
Is there any way to recover from this? Or would it be best to
delete the nodes contents and bootstrap it again?


If you scrub this SSTable (either with the online or offline
version of scrub) it will remove the corrupt data and re-write
the rest of the SSTable which isn't corrupt into a new SSTable.
That is probably safer for your data than deleting the entire set
of data on this replica. When that's done, restart the repair.

=Rob









Re: going down from RF=3 to RF=2, repair constantly falls over with JVM OOM

2013-07-04 Thread Michał Michalski
I don't think you need to run repair if you decrease RF. At least I 
wouldn't do it.


In case of *decreasing* RF have 3 nodes containing some data, but only 2 
of them should store them from now on, so you should rather run cleanup, 
instead of repair, toget rid of the data on 3rd replica. And I guess it 
should work (in terms of disk space and memory), if you've been able to 
perform compaction.


Repair makes sense if you *increase* RF, so the data are streamed to the 
new replicas.


M.


W dniu 04.07.2013 12:20, Evan Dandrea pisze:

Hi,

We've made the mistake of letting our nodes get too large, now holding
about 3TB each. We ran out of enough free space to have a successful
compaction, and because we're on 1.0.7, enabling compression to get
out of the mess wasn't feasible. We tried adding another node, but we
think this may have put too much pressure on the existing ones it was
replicating from, so we backed out.

So we decided to drop RF down to 2 from 3 to relieve the disk pressure
and started building a secondary cluster with lots of 1 TB nodes. We
ran repair -pr on each node, but it’s failing with a JVM OOM on one
node while another node is streaming from it for the final repair.

Does anyone know what we can tune to get the cluster stable enough to
put it in a multi-dc setup with the secondary cluster? Do we actually
need to wait for these RF3-RF2 repairs to stabilize, or could we
point it at the secondary cluster without worry of data loss?

We’ve set the heap on these two problematic nodes to 20GB, up from the
equally too high 12GB, but we’re still hitting OOM. I had seen in
other threads that tuning down compaction might help, so we’re trying
the following:

in_memory_compaction_limit_in_mb 32 (down from 64)
compaction_throughput_mb_per_sec 8 (down from 16)
concurrent_compactors 2 (the nodes have 24 cores)
flush_largest_memtables_at 0.45 (down from 0.50)
stream_throughput_outbound_megabits_per_sec 300 (down from 400)
reduce_cache_sizes_at 0.5 (down from 0.6)
reduce_cache_capacity_to 0.35 (down from 0.4)

-XX:CMSInitiatingOccupancyFraction=30

Here’s the log from the most recent repair failure:

http://paste.ubuntu.com/5843017/

The OOM starts at line 13401.

Thanks for whatever insight you can provide.