a memtable is flushed can the
commit log which contains the data being flushed, be removed by
Cassandra.
--
/ Peter Schuller
It would be great if Cassandra puts this on their roadmap. There is
lot of durability benefits by incorporating dc awareness into the
write consistency equation.
You may be interested in the discussion here:
https://issues.apache.org/jira/browse/CASSANDRA-2338
--
/ Peter Schuller
changes so
I'm not sure off hand what'll happen to the auto-system.gc code in
cassandra that attempts to free space.
CASSANDRA-2521 is IMO the real solution.
--
/ Peter Schuller
at the same time.
--
/ Peter Schuller
adding a lot of overhead in terms of disk I/O
unless your data set fits comfortably in memory.
--
/ Peter Schuller
kernels) barriers at the OS level - and
the list goes on.
--
/ Peter Schuller
will legitimiately get very low
values without it indicating anything is wrong.)
--
/ Peter Schuller
and cassandra service
5). read the key list generated in step 2) with consistency level ONE
How sure are you that the system is honoring fsync() properly,
including flushing any caches on underlying drives? Or is this with
battery backed caching RAID controllers?
--
/ Peter Schuller
the more I think of it ;)
--
/ Peter Schuller
Hmmm, I'm starting to like this idea more and more the more I think of it ;)
Filed: https://issues.apache.org/jira/browse/CASSANDRA-2699
--
/ Peter Schuller
. Particularly in a
situation with lots of dropped messages.
I'm getting the 2^15 from AntiEntropyService.Validator.Validator()
which passes a maxsize of 2^15 to the MerkelTree constructor.
--
/ Peter Schuller
/browse/CASSANDRA-2554 which may be
relevant, but simple overload is also a possible reason.
--
/ Peter Schuller
/) for
that.
--
/ Peter Schuller
Ok, so I think I found one major cause contributing to the increasing
Nice job tracking this down! That is useful to know, even outside of
Cassandra use cases. Frankly it's disappointing to learn what nio is
doing.
--
/ Peter Schuller
is useful since it allows consistency
semantics similar to ALL but allows you to survive nodes being down,
at the cost of a higher RF (3 at least))
--
/ Peter Schuller
? Including overwrites. If not, I'm
not sure what's going on. Since you said it took about a day of
traffic it feels fishy.
--
/ Peter Schuller
to be, due to the LRU:ishness of caches, the less frequently accessed
data that tends to make it difficult to judge by numbers that include
all I/O.
--
/ Peter Schuller
setting in question is the memtable_flush_after
setting. Do you have that set to something very high on one of your
column families?
You can use describe keyspace name_of_keyspace in cassandra-cli to
check current settings.
--
/ Peter Schuller
What is the best way to find keys of such big rows?
One, if not necessarily the best, way is to check system.log for large
row warnings that trigger for rows large enough to be compacted
lazily. Grep for 'azy' (or lazy case-insens) and you should find it.
--
/ Peter Schuller
store?
So the only thing I can do is test it and see how it goes. To make the
change affective, should I do anything beyond changing the value in
cassandra.yaml and restart the node? I'll try first with 256 and see
what happens.
That should be it.
--
/ Peter Schuller
to judge and likely depends a lot on i/o
scheduling and other details.
--
/ Peter Schuller
lead to a delay
of sstable removal which will vary with whatever else is happening
(the more busy the node, the more often a concurrent mark/sweep gc
phase is triggered, and the more frequently obsolete sstables are
deleted).
--
/ Peter Schuller
.
--
/ Peter Schuller
to the effects of the temporary spike in data size
and cache coldness. Sounds like it makes good sense in your situation
though.
--
/ Peter Schuller
that the need for major compactions is significantly lessened
or even eliminated. However, running major compactions won't cause
tombstones *not* to be removed; it's just not required *in order* for
them to be removed.
--
/ Peter Schuller
the former row key gets
restored to a point in time prior to that of the latter row key may
cause the latter write to become visible even though the former write
is lost.
--
/ Peter Schuller
: Think before typing. :)
--
/ Peter Schuller
Thanks Peter. I am using java version of the stress testing tool from the
contrib folder. Is there any issue that should be aware of? Do you recommend
using pystress?
I just saw Brandon file this:
https://issues.apache.org/jira/browse/CASSANDRA-2578
Maybe that's it.
--
/ Peter Schuller
haywire as you don't
service as many I/O requests as are coming in.
There is a grey area in between where latency will be very sensitive
to smallish changes in I/O load but aggregate throughput remaining
below what can be sustained.
--
/ Peter Schuller
.
--
/ Peter Schuller
usage during periods of timeouts. If the huge allocations fail due to
fragmentation and fallback to Full GC that might be an expected
result. Else -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps.
--
/ Peter Schuller
issues. *Maybe* after two full gc:s tops if the
first happens while there's a mix still active in memtables.
--
/ Peter Schuller
. the common case of wanting to listen on 127.0.0.1 but no public
interfaces...
--
/ Peter Schuller
to skip rows
that are obviously bad, but true integrity checking is not supported
at this time.
--
/ Peter Schuller
;)
--
/ Peter Schuller
.
+1. I think this is a great take-away w.r.t. schema changes.
--
/ Peter Schuller
*believe* this is true, but I don't dare
promise it.
--
/ Peter Schuller
if the total database size fits
reasonably well in page cache.
Pre-heating sstables on a per-cf basis on start-up would be a nice
feature to have.
--
/ Peter Schuller
weekly/whatever repair) is the most efficient way to achieve
consistency again.
--
/ Peter Schuller
on latency than trying to serve real
live traffic without a feedback mechanism with a system which is
processing fewer requests than incoming.
--
/ Peter Schuller
http://wiki.apache.org/cassandra/FAQ#unsubscribe
--
/ Peter Schuller
.
Something like Cassandra/postgresql/etc behaves similarly. If you're
running a benchmark at a fixed concurrency of N, you won't have more
than N total requests outstanding at any given moment.
--
/ Peter Schuller
reported by
the Cassandra client (and which client is it)?
I'm not sure what you mean w.r.t. boostrap/master etc. All nodes
should be entirely equal, with the exception of nodes that are marked
as seed nodes. But seed nodes going down should not cause reads and
writes to fail.
--
/ Peter Schuller
to saturate your nodes.
--
/ Peter Schuller
that I need to adjust any thresholds? And how to calculate
correct value?
Is this on the same cluster/nodes that you're doing your 250k column
stresses (the other thread)?
In any case, for typical cases there is:
http://www.datastax.com/docs/0.7/operations/tuning
--
/ Peter Schuller
to a write sees the written data,
that doesn't sound very compatible with automatically falling back to
CL.ONE...
Anyways, those are my off-the-cuff thoughts - maybe it doesn't apply
in the situation in question.
--
/ Peter Schuller
decreasing heap size is the
appropriate course of action.
--
/ Peter Schuller
in Cassandra than it is in
stress.py; so suddenly a single stress.py can easily saturate lots of
nodes simply because you can so trivially be writing data at very high
throughput by upping the column sizes
--
/ Peter Schuller
significant); but if this is not the case, then this
particular problem should be a matter of adjusting heap size according
to your memtable thresholds. I.e., increase heap size and/or decrease
memtable flush thresholds.
--
/ Peter Schuller
will then go to a
node which will have potentially zero cache hit rate on that data
since all reads up to that point were taken by the node that just went
down.
So it's not an obvious win depending.
--
/ Peter Schuller
populated?
--
/ Peter Schuller
question about queue size in iostat I see it ranging
from 114-300.
Saturated.
--
/ Peter Schuller
about seed nodes not
re-bootstrapping, but that's because seeds are special-cased in the
start-up process to never initiate bootstrapping and is also unrelated
really to the bug.)
--
/ Peter Schuller
Please unsubscribe my email address from user list .
http://wiki.apache.org/cassandra/FAQ#unsubscribe
--
/ Peter Schuller
balancing - depending on performance characteristics of nodes.
--
/ Peter Schuller
than you would otherwise, and the result of that is
additional compaction work (meaning, less sustainable write
throughput). But you won't be forced to have 120 gig nodes (a 120 gig
heap would be problematic for other reasons anyway).
--
/ Peter Schuller
Can someone comment on this ? Or is the question too vague ?
Honestly yeah I couldn't figure out what you were asking ;) What
specifically about the diagram are you trying to convey?
--
/ Peter Schuller
swapped out, you don't want stacks swapped
out,etc. As soon as you start swapping you very quickly run into poor
and unreliable performance. Particularly during GC.
--
/ Peter Schuller
note the virtual mem used 20.6 GB ?! and Shared 8.4 GB ?!
http://wiki.apache.org/cassandra/FAQ#mmap
--
/ Peter Schuller
. But still.
--
/ Peter Schuller
, assuming firewall allows.
Unless does not have cassandra installed means that you can't use
nodetool either?
--
/ Peter Schuller
is put into trying to make Cassandra
very useful for very very small heap sizes.
--
/ Peter Schuller
; there may be gotchas involved.
--
/ Peter Schuller
caching, you will have to adjust
accordingly as well.
--
/ Peter Schuller
if max heap is 7.5 gig.
--
/ Peter Schuller
Java also enjoys using all the memory your allocate and the Garbage
collection does not give it back unless it needs to.
This only explains why it never shrinks in top, not increased heap
usage (which is presumably the memtables/key/row caches already
mentioned).
--
/ Peter Schuller
heap size of
5.5GB? Or are you using VM options such that Xms != Xmx?
If 5.5 is the maximum heap size and you're at 4 gig after the
completion of a mark-and-sweep, I'd say it would be a good idea to
pre-emptively up the heap size a bit (or decreases memtables/key/row
caches).
--
/ Peter Schuller
My question is a lot simpler: does Conflict only happens at column
level? Or is it at row level?
Individual column level only.
--
/ Peter Schuller
My question is a lot simpler: does Conflict only happens at column
level? Or is it at row level?
Individual column level only.
Well, with the special exception of whole-row deletes.
--
/ Peter Schuller
? Try using stress.py from the Cassandra
tree as a comparison.
If you're sending one request at a time, there is no expectation at
all of a performance improvement - just a decrease in performance.
--
/ Peter Schuller
is often
conflated with fsck/repair in the sense of trying to fix something
unexpected/buggy/broken
Maybe nodetool aes or nodetool antientropy?
--
/ Peter Schuller
), and
(3) no node ever happens to be down, flap, or drop a request when you
touch the data in question.
Basically, unless you are really sure what you're doing - run repair.
--
/ Peter Schuller
Are you *sure* you want it? :)
--
/ Peter Schuller
for extended
periods of time. It doesn't take a lot at all for a *few* writes not
getting replicated (for example, just restarting a Cassandra node will
cause some writes to be dropped - hinted handoff is not a guarantee,
only an optimization).
--
/ Peter Schuller
as important :) But
it relies on 'nodetool repair' reliably exiting with non-zero exit
status on failures.
if nodetool returns an error this might work:
nodetool -h localhost repair touch /path/to/flagfile.tmp
That's the equivalent, due to 'set -e'.
--
/ Peter Schuller
I concur. JIRA time?
https://issues.apache.org/jira/browse/CASSANDRA-2405
--
/ Peter Schuller
. And not even all
data that is read at that (e.g. range slices don't imply repair).
--
/ Peter Schuller
sizing, but I can't find it right now.
--
/ Peter Schuller
There's a wiki page somewhere that describes the overall rule of thumb
for heap sizing, but I can't find it right now.
http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing
--
/ Peter Schuller
://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
(What *would* be useful perhaps is to be able to ask a node for the
time of its most recently started repair, to facilitate easier
comparison with GCGraceSeconds for monitoring purposes.)
--
/ Peter Schuller
-case and
requirements, may benefit from more frequent repair. But the important
part, is the minimum repair frequency mandated by Cassandra - and
determined by GCGraceSeconds.
--
/ Peter Schuller
be helpful I think, but
doesn't currently exist (as far as I know), such that the script (or
whatever) doing the repairs would have to solve that problem.
--
/ Peter Schuller
Can someone list some of the current international language
implementations of cassandra ?
What is an international language implementation of Cassandra?
--
/ Peter Schuller
would be hinted hand-off when writing at CL.ANY
and hinted hand-off, but I'll leave that beyond the scope of this.)
--
/ Peter Schuller
is large and the workload is such that young-gen gc:s
promote a high percentage of it's data, I suspect that can lead to CMS
triggering too late. (So this paragraph is speculation.)
--
/ Peter Schuller
. See cassandra-env.sh.
Well, in_memory_compaction_limit_in_mb can be relevant. But this
should only matter if you have large rows in your column family.
--
/ Peter Schuller
was not responsible anymore for that data.
But this doesn't explain why he was able to read the stale data? Or
did I miss something about actually having removed the second node
from the ring after it was shut off?
--
/ Peter Schuller
violate
the requirement of quorum.
For example, if you're at RF=3, at least two nodes (in the replica set
for a given key) must be responding to your request in order for them
to succeed at QUORUM.
--
/ Peter Schuller
with multiple disks.
--
/ Peter Schuller
device. This in turn, in the case of a RAID0, means
multiplying the target maximum queue depth with the number of drives.
--
/ Peter Schuller
instead of cfq.)
--
/ Peter Schuller
columns in a row. Maybe if you describe
your use-case.
--
/ Peter Schuller
of the
sstable for specific keys that have recently been accessed. (Normally
the index portion is seeked into, a bit of data is streamed from disk,
de-serialized, and used to find the offset for that key).
--
/ Peter Schuller
)
Anyways, I sympathize with your issues and the fact that you don't
have time to start attaching with profilers etc. Unfortunately I don't
know what to suggest that is simpler than that.
--
/ Peter Schuller
the general recommendation is to set the token manually to
avoid surprises.
--
/ Peter Schuller
I am going to try that.
Also, you may want to augment your VM options with:
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCTimestamps
That way there should hopefully be some corroborating evidence as to
the nature of the heap growth over time.
--
/ Peter Schuller
that involved being stuck in
system cpu spin... but I may be mis-remembering.
--
/ Peter Schuller
with the data
structures involved, as well as longer-term effects like fragmentation
in old-space in the GC (similarly to malloc/free).
--
/ Peter Schuller
instead of 10 gb. I haven't investigated
properly. Maybe just depending on mmap flags.
--
/ Peter Schuller
a nagle + delayed ACK problem. The PHP client probably isn't
setting the TCP_NODELAY flag (or the equivalent in Windows).
Google for nagle delayed ack for details.
--
/ Peter Schuller
201 - 300 of 552 matches
Mail list logo