deleting counters, we'd need to
revert those client and column family changes.)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
settings
(typically trading pollution of page cache vs. number of I/O:s).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
of a different story and if you want to
test behavior when nodes go down I suggest including that. See
CASSANDRA-2540 and CASSANDRA-3927.)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
FWIW, we're using openjdk7 on most of our clusters. For those where we
are still on openjdk6, it's not because of an issue - just haven't
gotten to rolling out the upgrade yet.
We haven't had any issues that I recall with upgrading the JDK.
--
/ Peter Schuller (@scode, http
On Oct 22, 2012 11:54 AM, B. Todd Burruss bto...@gmail.com wrote:
does nodetool cleanup perform a major compaction in the process of
removing unwanted data?
No.
of cassandra?
It's in the 1.1 branch; I don't remember if it went into a release
yet. If not, it'll be in the next 1.1.x release.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
a single sstable bigger
than what would normally happen, and it takes more total disk space
before it will be part of a compaction again.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
with slightly changed
workloads? It's very hard to blackbox-test GC settings, which is
probably why GC tuning can be perceived as a useless game of
whack-a-mole.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
the top of my head) the resulting value being correct
is if the later increment (N2 in this case) is somehow including N1 as
well (e.g., because it was generated by first reading the current
counter value).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
be safely retried. Cassandra counters are generally not
useful if *strict* correctness is desired, for this reason.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
the causes of un-predictable behavior w.r.t.
GC by being careful about it's memory allocation and *retention*
profile. For the specific case of avoiding *ever* seeing a full gc, it
gets even more complex.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
sstable will effectively cover almost the entire range
(since you're effectively spraying random tokens at it, unless clients
are writing data in md5 order).
(Maybe it's different for ordered partitioning though.)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
question is how often. But given the lack of handling of such failure
modes, the effect on clients is huge. Recommend data reads by default
to mitigate this and a slew of other sources of problems (and for
counter increments, we're rolling out least-active-request routing).
--
/ Peter Schuller
it is
in action.
FWIW, J9's balanced collector is very similar to G1 in it's design.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Our full gc:s are typically not very frequent. Few days or even weeks
in between, depending on cluster.
*PER NODE* that is. On a cluster of hundreds of nodes, that's pretty
often (and all it takes is a single node).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
by inter-region pointers). If you can avoid that, one might
hope to avoid full gc:s all-together.
The jury is still out on my side; but like I said, I've seen promising
indications.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Has anyone tried running 1.1.1 on Java 7?
Have been running jdk 1.7 on several clusters on 1.1 for a while now.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
This problem is not new to 1.1.
On Sep 6, 2012 5:51 AM, Radim Kolar h...@filez.com wrote:
i would migrate to 1.0 because 1.1 is highly unstable.
is
not as compact as PostgreSQL. For example column names are duplicated
in each row, and the row key is duplicated twice (once in index, once
in data).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
I think that was clear from your post. I don't see a problem with your
process. Setting gc grace to 0 and forcing compaction should indeed
return you to the smallest possible on-disk size.
(But may be unsafe as documented; can cause deleted data to pop back up, etc.)
--
/ Peter Schuller
over all rows:
for row_id, row in your_column_family.get_range():
https://github.com/pycassa/pycassa
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
amounts of data? Large or many columns (or
both), etc. Essentially all working data that your request touches
is allocated on the heap and contributes to allocation rate and ParNew
frequency.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
I can recommend Jolokia highly for providing an HTTP/JSON interface to
JMX (it can be trivially run in agent mode by just altering JVM args):
http://www.jolokia.org/
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
to disablegossip and make other nodes not send
requests to it. disabling thrift would also be advised, or even
firewalling it prior to restart.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
You're almost certainly using a client that doesn't set TCP_NODELAY on
the thrift TCP socket. The nagle algorithm is enabled, leading to 200
ms latency for each, and thus 5 requests/second.
http://en.wikipedia.org/wiki/Nagle's_algorithm
--
/ Peter Schuller (@scode, http
).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
remains
unknown as far as I can tell (had previously hoped the root cause were
thread-unsafe shard merging, or one of the other counter related
issues fixed during the 0.8 run).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
false positive rates:
https://issues.apache.org/jira/browse/CASSANDRA-3497
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
multithreaded_compaction: false
Set to true.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
retaining sstables and delaying
their unload/removal (index sampling/bloom filters filling your heap).
If it really only happens for leveled/snappy however, I don't know
what that might be caused by.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
sampling will be insignificant.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
other than what is driven by traffic, and I could imagine this
happening if read repair were turned completely off.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
is. iostat, top, nodetool
tpstats, nodetool netstats, nodetool compactionstats.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
bad latencies for a while
where you sometimes see consistently good latencies, that sounds
different but would hopefully be observable somehow.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
in the data
file, and not the actual data. The key cache will only ever eliminate
the I/O that would have been required to lookup the index entry; it
doesn't help to eliminate seeking to get the data (but as usual, it
may still be in the operating system page cache).
--
/ Peter Schuller (@scode
, and pending racking up consistently. If
you're just close, you'll likely see spikes sometimes.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
What is your total data size (nodetool info/nodetool ring) per node,
your heap size, and the amount of memory on the system?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
and have an affect on
latency.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
backing up constantly. If on the other hand
these are batch jobs where throughput is the concern, it's not
relevant.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
detects that you seem to be running low on heap.
Watch cassandra.log for messages to that effect (don't remember the
exact message right now).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
) :) ), and
depending on your replication factor might have no data (if replication were
1).
It's also incorrect for rack awareness if your topology is such that
the rack awareness changes ownership (see
https://issues.apache.org/jira/browse/CASSANDRA-3810).
--
/ Peter Schuller (@scode, http
It is expected that the schema is replicated everywhere, but *data*
won't be in the DC with 0 replicas.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
for the keyspace?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Again the *schema* gets propagated and the keyspace will exist
everywhere. You should just have exactly zero amount of data for the
keyspace in the DC w/o replicas.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
with data loss prior to
the data making it elsewhere.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
. A good rule of thumb is that an individual node should be
able to handle your traffic when doing compaction; you don't want to
be in the position where you're just barely dealing with the traffic,
and a node doing compaction not being able to handle it.
--
/ Peter Schuller (@scode, http
-write invariant.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
if you're almost *entirely* full and are trying to delete data
in a panic.
How far away are you from entirely full? Are you just worried about
the future or are you about to run out of disk space right now?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
compaction than size tiered compaction, but still
important.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
), it's not necessarily the best trade-off.
Consider the time it takes to do repairs, streaming, node start-up,
etc.
If it's only about CPU resources then bigger nodes probably make more
sense if the h/w is cost effective.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
into old-gen).
Experiment on a single node, making sure you're not causing too much
disk I/O by stealing memory otherwise used by page cache. Once you
have something that works you might try slowly going back down.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
if it's only within
a DC).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
even bother
spending time on this until you're on 1.0, unless this is about a
production cluster that you cannot upgrade for some reason.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
useful to try to select hardware such that you
have a greater number of smaller nodes, than it is to focus on node
count (although once you start reaching the few hundreds level
you're entering territory of less actual real-life production
testing).
--
/ Peter Schuller (@scode, http
there are hints on the node, for other
nodes, that contain writes to a deleted column family.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
) and the strategy (assuming you are
using NetworkTopologyStrategy) will log selected endpoints, and
confirm that it's indeed picking endpoints that you think it should
based on getendpoints.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
in the replica set of your row has to be up.
Will the read repair happen automatically even if I read and write using the
consistency level ONE?
Yes, assuming it's turned on.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
settings to control the
rate of allocation generated during a compaction, even if you don't
need it for disk I/O purposes.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
://wiki.apache.org/cassandra/LargeDataSetConsiderations - that bloom
filters + index sampling will be responsible for most memory used by node.
Caching itself has minimal use on large data set used for OLAP.
I added some information at the end.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Compaction should delete empty rows once gc_grace_seconds is passed, right?
Yes.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
benchmarks ever published on the internet...
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
If I change endpoint_snitch from SimpleSnitch to PropertyFileSnitch,
does it require restart of cassandra on that node ?
Yes.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
will be completely unacceptable in some
circumstances. In others, perfectly acceptable due to the decrease in
memory use and few reads.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
to be configurable IMO.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
than short-term
avoiding a fallback to full GC.
I suggest you describe exactly what the problem is you have and why
you think stopping compaction/repair is the appropriate solution.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
during compaction.
But really, in short: The easiest fix is probably to increase the heap
size. I know this e-mail doesn't begin to explain details but it's
such a long story.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
than the number of reads to the CF (unless you
happen to have exactly one sstable or no rows ever span sstables).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
One other thing to consider is are you creating a few very large rows ? You
can check the min, max and average row size using nodetool cfstats.
Normall I agree, but assuming the two-node cluster has RF 2 it would
actually not matter ;)
--
/ Peter Schuller (@scode, http
(assuming a non-saturated system) and would definitely
invite investigation IMO.
If you're willing to start iterating with the source code I'd start
bisecting down the call stack and see where it's happening .
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
to be in on the given node (I am assuming you're
not using leveled compaction). That is in addition to any imbalance
that might result from your population of data in the cluster.
Running repair can affect the live size, but *lack* of repair won't
cause a live size divergence.
--
/ Peter Schuller (@scode
).
Your graph is looking very unusual for CMS. It's possible that
everything is as it otherwise should and CMS is kicking in too late,
but I am kind of skeptical towards that even the extremely smooth look
of your graph.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
to a
lower percentage, making the concurrent mark phase start earlier.
* Increase heap size significantly (probably not necessary based on
your graph, but for good measure).
If it then goes away, report back and we can perhaps figure out
details. There are other things that can be done.
--
/ Peter
that in terms of old-gen.
Start with this in any case: Run Cassandra with -XX:+PrintGC
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
the JDK
provides no way to properly monitor for GC events within the Java
application. The GC inspector can miss a GC.
Also, the GC inspector only tells you time + type of GC; a GC log will
provide all sorts of details.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
catch up.
Whether or not non-zero is a problem depends on the Cassandra version,
how many concurrent compactors you are running, and your column
families/data sizes/flushing speeds etc. (Sorry, kind of a long story)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
specifically? I mean I can say that I've
run Cassandra on Debian Squeeze in production, but that doesn't really
help you ;)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
, you will see a
magnification by a factor of RF on the local statistics (in aggregate)
relative to the StorageProxy stats.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
traffic to the cluster.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
not be possible).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
is 5 node, BOP, RF=3, AMD opteron 4174 CPU (6 x 2.3 Ghz cores),
Gigabit ethernet, RAID-0 SATA2 disks
For starters, what *is* the throughput? How many counter mutations are
you submitting per second?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
and not something I'm gonna be working
on in my spare time, so I don't know whether or when I would actually
work on that ticket (depends on priorities). I have the ideas but I
can't promise to fix it :)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
/O, it's not surprising at
all that repairing ~ 100 gigs of data takes more than 24 hours.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
to fall back to full GC; usually the opposite is true.
However, a larger heap can make some of the non-full GC pauses longer,
depending. In either case, r unning with the above GC options will
give you specific information on GC pauses and should allow you to
rule that out (or not).
--
/ Peter
the
maintenance like the repair.
Ok. So what i'm getting at then is that there may be real legitimate
connectivity problems that you aren't noticing in any other way since
you don't have active traffic to the cluster.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
work for 0.8.6 I suspect (but no
guarantees). This also assumes you have no reads running against the
cluster.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
you're seeing should be
indicative that it really was considered Down by the node. You might
grep the log for references ot the node in question (UP or DOWN) to
confirm. The question is why though. I would check if the node has
maybe automatically restarted, or went into full GC, etc.
--
/ Peter
Filed https://issues.apache.org/jira/browse/CASSANDRA-3569 to fix it
so that streams don't die due to conviction.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
by decommissioning node N.
(Unless I'm tripping myself up somewhere now...)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
show up as a
huge amount of pending ReadStage. iostat -x -k 1 should confirm it.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
I'm missing?
I don't know, but I am guessing this is just a poor UI issue. The CLI
is probably getting an Unavailable exception and just giving you
nothing instead of reporting a problem, but that is speculation on my
part.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
that is not part
of the replica set (and is not expected to have any data for the row)
cannot usefully participate in the read.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
No it's not just the cli tool, our app has the same issue coming back with
read issues.
You are supposed to not be able to read it. But you should be getting
a proper error, not an empty result.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
you can access cassandra data at any node and keys can be accessed at
random.
Including individual columns in a row.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
to enable safe use of
allocated memory without it being possible to use to break out of the
JVM sandbox.
JNA or misc.unsafe allows explicit freeing (at the cost of application
bugs maybe segfaulting the JVM or causing other side-effects; i.e.,
breaking out of the managed runtime sandbox).
--
/ Peter
of deleted entries until the last vacuum - improved in
modern versions)).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
1 - 100 of 552 matches
Mail list logo