you're not bottlenecking there?
(I have no idea how fast phpcassa is.)
--
/ Peter Schuller
not be
causing the problems you are seeing, in general. Especially not with a
5 gb heap size. I think it is highly likely that there is some
little detail/mistake going on here rather than a fundamental issue.
But regardless, it would be nice to discover what.
--
/ Peter Schuller
a
future start-up to never trigger bootstrapping.)
--
/ Peter Schuller
up a cluster, you would
not normally want to have a node join the ring without auto_bootstrap
enabled since it will serve requests yet be void of any data.
--
/ Peter Schuller
would probably be required
under real workloads of different kinds to determine whether G1 seems
suitable in its current condition.
--
/ Peter Schuller
a large key cache and row
cache I fully expect there to be issues. A large LRU is exactly the
type of workload where I seem to consistently break it...
--
/ Peter Schuller
not realistic.
--
/ Peter Schuller
Also:
* What is the frequency of the pauses? Are we talking every few
seconds, minutes, hours, days
* If you say decrease the load down to 25%. Are you seeing the same
effect but at 1/4th the frequency, or does it remain unchanged, or
does the problem go away completely?
--
/ Peter Schuller
.
--
/ Peter Schuller
Cassandra nodes will always
freeze for 30 seconds every now and then is also helping no one, other
than not being true.)
--
/ Peter Schuller
memory to
slabs for different object sizes, when the object size distribution
changes.
--
/ Peter Schuller
by jumping through some
extra hoops (essentially using an 'extra' node so you can do
insertions followed by removals instead of moves), but it's more of a
hassle.
--
/ Peter Schuller
cause data to be
replicated. But until that's done, the node should be returning
inconsistent results.
So, turning off auto_bootstrap probably just hid/changed the symptom
of the problem you're seeing rather than fix it./
--
/ Peter Schuller
because it's suddenly unclear to me how
this is supposed to work with respect to nodes being down (supposing
it's truly down, forever, and needs to be replaced).
Anyone?
--
/ Peter Schuller
Do you have row cache enabled? Disable it. If it fixes it and you want
it, re-enable but consider row sizes and the cap on the cache size..
--
/ Peter Schuller
who knows...
But yes, it's certainly a very valid point that concurrent streaming
I/O can be an issue. But as you say, given that there is a reasonably
small amounts of concurrent writes that (except for the commit log)
write in large chunks, it should not be a major concern.
--
/ Peter Schuller
is the right approach to the problem
since that's more about using the Cassandra data model appropriately.
--
/ Peter Schuller
expect
fragmentation is if you run Cassandra on a file system that also does
something else that fragments the hell out of it.
--
/ Peter Schuller
of bandwidth.
--
/ Peter Schuller
.
--
/ Peter Schuller
correct me if I'm painting
a too gloomy picture here.)
--
/ Peter Schuller
there. Sorry.)
--
/ Peter Schuller
), this
essentially implies co-ordination on every read, at all times.)
--
/ Peter Schuller
() has a tendency to cause swapping out of
the JVM heap. But this is not because the process actually uses more
memory as such.
I didn't read it now but scrolling through it seems the wikipedia
article is a pretty good intro:
http://en.wikipedia.org/wiki/Virtual_memory
--
/ Peter Schuller
management technique used
by the JVM whether this is practical. (It just occurred to me that
Azul should get this for 'free' in their GC. Wonder if that's true.)
--
/ Peter Schuller
memtables thresholds far too aggressively for your heap
size, resulting in an out-of-memory error which, given Cassandra's
default JVM options, results in the node dying.
Bottom line: Check /var/log/cassandra/system.log to begin with and see
if it's reporting anything or being restarted.
--
/ Peter
limits :)
--
/ Peter Schuller
it
will not be sorted on key (it will be sorted on ring token).
--
/ Peter Schuller
.)
--
/ Peter Schuller
So far so good, but it regularly happens, that data from one application
ends up in columnfamilies reserved for the other application as well as the
intended columnfamily.
Maybe https://issues.apache.org/jira/browse/CASSANDRA-1992
--
/ Peter Schuller
complete single handbook style documentation is:
http://www.datastax.com/docs/0.7/index
Other than that, the wiki is probably the most structured place.
--
/ Peter Schuller
with default settings).
If you've got a 3 gig heap size and the other nodes stay at 500 mb,
the question is why *don't* they increase in heap usage. Unless your
500 mb is the report of the actual live data set as evidenced by
post-CMS heap usage.
--
/ Peter Schuller
(If you're looking at e.g. jconsole graphs a screenshot of the graph
would not hurt.)
--
/ Peter Schuller
repair nor read repair is primarily intended to address
arbitrary data corruption but rather to reach eventual consistency in
the cluster (after writes were dropped, a node went down, etc).
--
/ Peter Schuller
arbitrary
corruption (anyone?).
--
/ Peter Schuller
is a very strong and blanket requirement
for making the decision to mix small rows with larger ones.
--
/ Peter Schuller
/tentative opinion is that the clean fix is for Cassandra to
support strong checksumming at the sstable level.
Deploying on e.g. ZFS would help a lot with this, but that's a problem
for deployment on Linux (which is the recommended platform for
Cassandra).
--
/ Peter Schuller
. Checksumming at the sstable level would allow
detection of corruption, and regular repair and read-repair would
provide self-healing.
--
/ Peter Schuller
.
--
/ Peter Schuller
against remote machines, if your test is a sequential workload.
Adding machines increases aggregate throughput across multiple
clients; it won't make individual requests faster (except indirectly
of course by avoiding overloaded conditions).
--
/ Peter Schuller
time columns are massively useful/important; without them
one is much more blind as to whether or not storage devices are
actually saturated.
--
/ Peter Schuller
completed).
--
/ Peter Schuller
). It is unrelated to counting
rows or columns, unless the application happens to use them for that.
--
/ Peter Schuller
with that.)
--
/ Peter Schuller
.
--
/ Peter Schuller
to say too
much here. Anyone else? (anti-entropy granularity, compaction
in-memory thresholds and GC tweaking, etc)
--
/ Peter Schuller
What port number do I actually need?
8080 (well, up until just very recently in trunk). Look at the VM
options Cassandra is running with and you'll see the JMX agent port:
-Dcom.sun.management.jmxremote.port=8080
--
/ Peter Schuller
. GCGraceSeconds
with respect to tombstones is different, and refers to garbage
collecting tombstones by removing them during compaction.
--
/ Peter Schuller
you still have concerns
that the data has not propagated correctly).
--
/ Peter Schuller
to be working around
https://issues.apache.org/jira/browse/CASSANDRA-1955
--
/ Peter Schuller
to acquire a new one. Any
initial token in the configuration is ignored; it is only the
*initial* token, quite literally. Changing the token would require a
'nodetool move' command.
--
/ Peter Schuller
better overal control over how flushing happens and when writes start
blocking, rather than necessarily implying that there can't be more
than one memtable (the ticket Stu posted seems to address one such
means of control).
--
/ Peter Schuller
you're
doing, the critical point is that you should run 'nodetool repair'
often enough relative to GCGraceSeconds:
http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
--
/ Peter Schuller
stated it (some cf:s that got
written at a slow pace and just happened to flush at the same time),
that should be enough.
(If you have some CF:s being written to faster than they are flushed,
there would still be potential for one CF to hog the flush writers
unfairly.)
--
/ Peter Schuller
@Peter Isn't clean up a special case of compaction? IE it works as a
major compaction + removes data not belonging to the node?
Yes, sorry. Brain lapse. Ignore my.
--
/ Peter Schuller
sure you're not
swapping a bit?
Also, what do you mean by node instability - does it *completely* stop
responding during these periods or does it flap in and out of the
cluster but is still responding?
Are you nodes disk bound or CPU bound during compaction?
--
/ Peter Schuller
Adding CL.TWO would be easy enough. :)
True, but the obvious generalization is to be able to select an
arbitrary replica count and that seemed like a bigger change to the
API. But if CL.TWO would be considered clean enough... I may submit a
jira/patch.
--
/ Peter Schuller
, decreasing:
in_memory_compaction_limit_in_mb: 64
from the default of 64 might help here I suppose.
--
/ Peter Schuller
all)? Given
sufficient amounts of overwrites/deletions, variations in compacting
timing could account for differences.
--
/ Peter Schuller
add them. If you're looking to add e.g. 25% more
nodes that requires moving nodes around. Be aware that moving a node
implies decomission + bootstrap, so the node will temporarily not be
contributing to cluster capacity during the move.
--
/ Peter Schuller
to the version that was installed by the package;
there is probably a mistake in there somewhere. (Yes, the error
message could of course be more informative...)
--
/ Peter Schuller
this method ?
Not in production, but I've done testing with values on the order of
a few megs. Expect compaction to be entirely disk bound rather than
CPU bound. Make sure latency is acceptable even when data sizes grow
beyond memory size.
--
/ Peter Schuller
.
--
/ Peter Schuller
line, to answer that question for a particular key.
--
/ Peter Schuller
the data?
If *all* nodes in the replica set for a particular row are down, then
you won't be able to read that row, no.
--
/ Peter Schuller
for a column, that is older than the tombstone that indicates
it removal. Hence, the expiry of the tombstones is safe.
--
/ Peter Schuller
expecting write load to be high such that performance of
writes (and compaction) is a concern, or is it mostly about slowly
building up huge amounts of data that you want to be compact on disk?
--
/ Peter Schuller
because the contraints of the
cluster were violated (i.e., expired tombstones prior to nodetool
repair having been run).
I hope that clarifies.
--
/ Peter Schuller
, you do have to consider the expected start-up time when
sizing your row cache.
--
/ Peter Schuller
for the schema
that you actually had in 0.6) the data will be available again. I.e.,
the moment the schema is created it is essentially created
pre-populated with your sstables instead of empty.
--
/ Peter Schuller
?
--
/ Peter Schuller
the write it officially ACK:ed.
Cassandra would now be violating the consistency guarantees it
pretended to have.
(That is ignoring any potential issues directly resulting from a node
having an internally inconsistent state w.r.t. what's actually stored
on the node.)
--
/ Peter Schuller
, and if all your discrepancies are in the form of
forgotten deletes, then GCGraceSeconds/repair seems like a likely
candidate cause.
--
/ Peter Schuller
/browse/CASSANDRA-1316.
Subtle, indeed.
I've attempted to document this here:
http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds
--
/ Peter
do non-rate-limited writes to the cluster,
results are probably negative.
I'll file a bug about this to (1) elicit feedback if I'm wrong, and
(2) to fix it.
--
/ Peter Schuller
Filed: https://issues.apache.org/jira/browse/CASSANDRA-1955
--
/ Peter Schuller
in the size of rows (or very
few rows). I was hoping someone else would spot something but the
thread seems dead still :)
--
/ Peter Schuller
of very large rows that are screwing
up the statistics?
--
/ Peter Schuller
, with
multiple queries.
--
/ Peter Schuller
will take lots of iops.
--
/ Peter Schuller
I know that there's a limit, and I just assumed that the CLI set it to 100,
until I saw more than 100 results.
Ooh, sorry. Didn't read carefully enough. Not sure why you see that
behavior. Sounds strange; should not be supported at the thrift level
AFAIK.
--
/ Peter Schuller
is a background operation. It is never
the case that reads or writes somehow wait for compaction to
complete.
--
/ Peter Schuller
, that should be a minor issue because it should
be easy to avoid the possibility of getting to the point of not
fitting a single (size limited) sstable.
--
/ Peter Schuller
-existing silver bullet in a current
release; you probably have to live with the need for
greater-than-theoretically-optimal memory requirements to keep the
working set in memory.
--
/ Peter Schuller
mutation,
concurrent readers may see a partially applied batch mutation for a
given row even though it is only being written to a single node.
--
/ Peter Schuller
instead of byte[]. This would affect the thrift API and some internal
stuff in Cassandra. Likely whatever code is passing a byte[] needs to
be updated.
(I haven't used the hadoop support though so I'm not sure what the
culprit is specifically. Perhaps someone can fill in.)
--
/ Peter Schuller
), such a feature could be
turned off. But it would make it less easy to accidentally write
yourself into a corner.
--
/ Peter Schuller
will probably be 256 mb. And with out-of-the-box
cassandra.yaml settings I would not be surprised if you're dying with
an OutOfMemory error and possibly with high amounts of GC activity
before actually dying.
--
/ Peter Schuller
is being sent (anyone?).
In any case: Monitoring disk-space is very very important.
BTW what precisely does the Owns column mean?
The percentage of the token space owned by the node.
--
/ Peter Schuller
system to offer
specific claims as to what problem you triggered.
--
/ Peter Schuller
munin/zabbix/etc. It would be pretty nice to have that
out of the box with Cassandra, though I expect that would be
considered bloat. :)
--
/ Peter Schuller
. This means that while latency is a great indicator to look
at to judge what the current user perceived behavior is, it is *not* a
good thing to look at to extrapolate resource demands or figure out
how far you are from saturation / need for more hardware.
--
/ Peter Schuller
for extended periods once larger
multi-hundreds-of-gig sstables are being compacted. However, that
said, if you are just continually increasing your sstable count
(rather than there just being spikes) that indicates compaction is not
keeping up with write traffic.
--
/ Peter Schuller
.)
We should be careful not to mislead people. Talking about 16TB XFS
setup, or 100TB/node without any difficulties , seems very very far
from the common use case.
I completely agree. I didn't mean to imply that and I hope no one was mislead.
--
/ Peter Schuller
to object if any information
is misleading.
--
/ Peter Schuller
is going to be whether caching is effective at
all, and how much additional caching would help.
In any case, it would be interesting to know whether you are seeing
more disk seeks per read than you should.
--
/ Peter Schuller
really does seem indicative of a
problem, unless the the bottleneck is legitimately reads from disk
from multiple sstables resulting from rows being spread over said
sstables.
--
/ Peter Schuller
never worked properly for me in my browsers... too broken-ajaxy.)
--
/ Peter Schuller
Sorry for spam again. :-)
No, thanks a lot for tracking that down and reporting details!
Presumably a significant amount of users are on that version of Ubuntu
running with openjdk.
--
/ Peter Schuller
for the ln fix.)
--
/ Peter Schuller
301 - 400 of 552 matches
Mail list logo