om
partitioner, while having reasonable efficiency on range slices by
making each row contain a pretty large range such that any additional
overhead in jumping across nodes is negligible in comparison to the
other work done.
--
/ Peter Schuller
CPU bound rather than being close
to saturate disk.
In either case, please do report back as it's interesting to figure
out what kind of performance issues people are seeing.
--
/ Peter Schuller
e future.
--
/ Peter Schuller
n operation. How that translates to "used"
memory from the perspective of the operating system will largely
depend on JVM settings.
--
/ Peter Schuller
value based conflict
resolution if you are expecting a set of column updates to either
apply or not apply as a group (with respect to some other group of
updates). That's a bit subtle.
Also on the topic of granularity, entire super columns and entire rows
may be deleted without individually referring to all columns. In those
cases, deletes span entire rows or supercolumns rather than individual
columns.
--
/ Peter Schuller
el
clients?)
But of course, there is no such guarantee in the distributed sense either way.
--
/ Peter Schuller
ink this should be documented, because engineers will hit that 'local'
> undeterministic issue for sure if two instances of their applications
> perform 'completed writes' in the same column family. Completed does not
> mean successful, even with quorum (or ALL). They ought to know it.
I think it does. I believe the results you are describing as
unexpected are fully expected fundamentally, and there is no real
difference implied in receiving a timestamp ACK flag back. I'm totally
open to being wrong or having misunderstood something (or both), but
right now I don't see it. If on the other hand I'm not wrong then
perhaps we can figure out how to document or present the functionality
of Cassandra better :)
--
/ Peter Schuller
output etc. But how to interpret
them will be very dependent on what you're doing.
How's that for a non-answer? :)
--
/ Peter Schuller
anyway just after you did your read).
Perhaps I'm misunderstanding what you're trying to do?
--
/ Peter Schuller
set, or it was given the right a bit later in the background, or it
happened during anti-entropy etc.
--
/ Peter Schuller
r need to remove a value and then
re-write it, you have to up the timestamp on your inserts in order for
it to be re-inserted. Same thing if you sometimes need to actually
overwrite said value in some different context than the concurrent
writer case you're trying to solve.
--
/ Peter Schuller
e more disk bound.
Unless you have a problem with compaction speed not staying caught up
with write speeds, this is not necessarily bad as it limits the impact
of compaction on disk I/O (i.e., less negative effects for real
traffic assuming real traffic goes down to disk and isn't entirely
in-memory).
--
/ Peter Schuller
cumstances, the key cache should
likely not be huge in comparison to available memory.
--
/ Peter Schuller
hat it is trying to do that you want this behavior
for?
--
/ Peter Schuller
And I
> apologize in advance if I've missed something obvious, I have all of about
> 1.5 days experience and I'm trying to learn fast ;-)
Does both A and B agree about the ring state (i.e., they both know
about both nodes) prior to you shutting one of them down?
--
/ Peter Schuller
means that XXX out of YYY bytes in the heap is live).
(I am making *some* simplifications here because the concurrent nature
of CMS means that you never get a snapshot view; but for practical
purposes you can consider the above to be true with Cassandra.)
Similar lines for "ParNew" you can mostly ignore for the purpose of
monitoring heap usage unless you specifically know what you're looking
for in those. The ConcurrentMarkSweep ones are what tell you what the
actual amount of live data in the heap is.
--
/ Peter Schuller
o make the assumption that
activley used data fits in RAM will severely affect the hardware
requirements for serving your load.
--
/ Peter Schuller
disable the memtable flush part of startup?
While I don't argue against such a change, one argument for *having*
the flush is that subsequent restarts become faster and it might help
if you're in some kind of situation where nodes continually die
shortly after having been started.
(But I total
full GC for the purpose of shrinkage.
--
/ Peter Schuller
ficient with larger heap sizes and in order to avoid bad policy
decisions of the GC causing excessive GC activity rather than just
growing the heap which you're fine with anyway.
--
/ Peter Schuller
L_FUTURE (Cassandra does the former).
--
/ Peter Schuller
omments).
Can anyone confirm/deny that this is an intended guarantee?
--
/ Peter Schuller
at much less tweaking of heap sizes was
necessary, but this is not the case.)
--
/ Peter Schuller
ntly useful (i.e., lots of truly unused
stuff that is good to keep swapped out).
--
/ Peter Schuller
here are "Y" keyspaces, Is
> it recommended to allocate "XY" as the max Heap size ? Please let me know.
Yes. Each column family will have a memtable subject to the configured
memory constraints; whether or not they are in different keyspaces
does not matter.
--
/ Peter Schuller
of the data to inspect when doing read repair.
> Will this result in better read performance?
Sorry, I did the impolite thing and began responding before having
read your entire E-Mail ;)
So yes, a low RF would increase read performance, but assuming you
care about data redundancy the better way to achieve that effect is
probably to decrease or disable read repair.
--
/ Peter Schuller
> PS. Are other ppl interested in this functionality ?
> I could file it to JIRA as well...
I was about to post that such a thing was useful for point-in-time
recovery before reading your post, so yes :)
--
/ Peter Schuller
erts.
(That said, deletes are a bit special in other ways to go
GCGraceSeconds, but that is another story.)
--
/ Peter Schuller
I forget and I didn't find it by brief sifting through thread history;
were you running on small EC2 instances or larger ones?
--
/ Peter Schuller
the circumstances of
compaction problems that people do have.
--
/ Peter Schuller
assandra/LiveSchemaUpdates
> How could I change, for example, the rows_cached or name
> fields below on the fly without losing data? Is this possible?
Check out the system_* methods at the bottom of
interface/cassandra.thrift. I believe these are in working order for
0.7.
--
/ Peter Schuller
s that the .deb you're running (and now don't
have) was built without other modification from the 0.6.1 tag. IIRC
debuild is in 'build-essential' in Debian.
--
/ Peter Schuller
ta; generally the larger the individual values the
more disk bound you'd tend to be.)
Just trying to zero in on what the likely root cause is in this case.
--
/ Peter Schuller
tion to spreading writes
across all nodes in the cluster, you are submitting writes with
sufficient concurrency to allow Cassandra to scale to use available
CPU across all cores.
--
/ Peter Schuller
fic does
not make much of a difference. But as a general recommendation,
distributing the load across machines is a good default choice of
behavior.
(Of course not doing so should not cause the stack overflow you're
seeing, so this is a separate issue.)
--
/ Peter Schuller
ut this could be entirely wrong; don't take my word
for it - anyone else?).
On the other hand, if the only concern is I/O bandwidth rather than
CPU use maybe the situation is different. (Does anyone have numbers on
variation in available disk bandwidth over time for EC2 instances?)
--
/ Peter Schuller
real fix (on the premise that there is no legitimate significant stack
depth).
I'm not sure what the best course of action is here short of going
down the path of trying to investigate the problem at the JVM level.
I'm hoping someone will come along and point to a simple explanation
that we're missing :)
--
/ Peter Schuller
he default stack size on Windows is so small as to make
this an expected outcome given the stack depth in Cassandra.
I wonder if there is memory corruption going on that causes the
overflow. Or am I missing something simple?
--
/ Peter Schuller
and of itself, no. That's more 1214 (above).
--
/ Peter Schuller
e. Do you have a URL / bug id or
anything that one can read up on about this?
--
/ Peter Schuller
er some circumstances,
contribute to decreasing the impact of compaction once it happens, but
that would be highly dependent on I/O scheduling and other
circumstances. I wouldn't touch if unless I knew specifically there
was a reason to.
--
/ Peter Schuller
e due to regular traffic when compaction is not happening?
--
/ Peter Schuller
is possible that EC2 instances are over-committed on
memory such that swapping is happening behind the scenes on the
host... but I have always assumed memory was not over-committed on
EC2.
Can you run with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps?
--
/ Peter Schuller
> Can anyone help shed any light on why this might be happening? We've tried a
> variety of JVM settings to alleviate this; currently with no luck.
Extremely long ParNew (young generations) pause times are almost
always due to swapping. Are you swapping?
--
/ Peter Schuller
ower outages affecting multiple writes or even entire data
centers.
--
/ Peter Schuller
ble [0x]
"Surrogate Locker Thread (CMS)" daemon prio=5 tid=0x000801cec800
nid=0x801c3d500 waiting on condition [0x]
"Finalizer" daemon prio=8 tid=0x000801ced800 nid=0x801c3ddc0 in
Object.wait() [0x7f3f6000]
"Reference Handler" daemon prio=10 tid=0x000801cef000
nid=0x801c3e680 in Object.wait() [0x7f4f7000]
"main" prio=5 tid=0x000801cef800 nid=0x800c0ae40 runnable
[0x7fbfe000]
"VM Thread" prio=9 tid=0x000801d22000 nid=0x801c3ef40 runnable
"Gang worker#0 (Parallel GC Threads)" prio=9 tid=0x000801d26000
nid=0x801c41cc0 runnable
"Gang worker#1 (Parallel GC Threads)" prio=9 tid=0x000801d25000
nid=0x801c41400 runnable
"Gang worker#2 (Parallel GC Threads)" prio=9 tid=0x000801d24800
nid=0x801c40b40 runnable
"Gang worker#3 (Parallel GC Threads)" prio=9 tid=0x000801d24000
nid=0x801c40280 runnable
"Concurrent Mark-Sweep GC Thread" prio=9 tid=0x000801d23800
nid=0x801c3f800 runnable
"VM Periodic Task Thread" prio=10 tid=0x000801d20800
nid=0x908d139c0 waiting on condition
--
/ Peter Schuller
o draw conclusions about performance. And you don't want those
exceptions in your logs.
--
/ Peter Schuller
or the real production use case.
This is not to complain that you are not providing enough information,
but to make the point that the performance of these things can be
subtly affected by many things and it is difficult to predict expected
performance to a high degree of precision - particularly when multiple
different forms of caching involved in the comparison.
--
/ Peter Schuller
am I missing something?
--
/ Peter Schuller
Cassandra is not reflective of reality. This is
especially true for large data sets (and by that I mean "amount of
data on a single machine").
--
/ Peter Schuller
problem. I'm not sure whether or not this is a reasonable hypothesis
with C# + Thrift on Windows. Sorry, I don't know what the easy fix is
either but perhaps it might help dig further.
[1] http://en.wikipedia.org/wiki/Nagle's_algorithm
--
/ Peter Schuller
ad for each row is probably the least flattering case for
Cassandra when disk bound.
Without knowing more details it's probably difficult to offer specific
explanations for this particular case.
--
/ Peter Schuller
nce in
question will be collected - *ever* - since G1 always picks the "best"
regions first (best in terms of "bang for the buck" - the most memory
reclaimed at the lowest cost).
--
/ Peter Schuller
how
incremental mode is implemented, but I doubt they've avoided this), is
that the total time needed for the mark/sweep onces it does run is
higher, such that you retain more floating garbage that might
otherwise have been collected.
--
/ Peter Schuller
default it is
simply a matter of tweaking.
So, I guess I should re-phrase: In terms of just turning on
incremental mode without at least application specific tweaking (if
not deployment specific testing), I would suggest caution.
--
/ Peter Schuller
on't know much about the incremental duty cycles are scheduled and it
may be the case that Cassandra is not even remotely close to having a
problem with incremental mode. But I would suggest caution and testing
prior to making it the default.
--
/ Peter Schuller
omplete a CMS sweep in time and you hit the maximum heap size, but
unless that happens, CMS will run concurrently (though there are
stop-the-world pauses involved, that are typically very short, the
mark/sweep phase is concurrent).
As jbellis pointed out, you're almost certainly swapping.
ld not be there.
Haver other people done high-throughput writes (to the point of CPU
saturation) over extended periods of time while consistently seeing
low latencies (consistencty meaning never exceeding hundreds of ms
over several days)?
--
/ Peter Schuller
on
and thus by definition not eligible for garbage collection).
--
/ Peter Schuller
apping with the following single error in
> the cassandra.log
> Error: Exception thrown by the agent : java.lang.NullPointerException
>
> I assume this is an unrelated problem?
Do you have a full stack trace?
--
/ Peter Schuller
> I am trying to run Cassandra 0.7 and I am getting different errors: First it
> was while calling client.insert and now while calling set_keyspace (see
> below).
Are you perhaps not using a framed transport with thrift? Cassandra
0.7 uses framed by default; 0.6 did not.
--
/ Peter Schuller
n
production environments with very large heap sizes.
--
/ Peter Schuller
ly large
individual values will be relevant to transient high memory use when
they are read/written.
In general, lacking large row caches and such things, you should be
able to have hundreds of millions of entries on an 8 gb heap, assuming
reasonably sized keys.
--
/ Peter Schuller
correlate well in time with each
other and/or something else?
--
/ Peter Schuller
;> What is possible causing this?
What is your GC grace seconds set to? Is it lower than 8-9 days, and
is it possible one or more nodes were disconnected from the remainder
of the cluster for a period longer than the GC grace seconds?
See: http://wiki.apache.org/cassandra/DistributedDeletes
--
/ Peter Schuller
ue on my
desktop which runs freebsd/zfs) - but I am hoping that is specific to
the FreeBSD port
--
/ Peter Schuller
939
8.46099853516
8.44287872314
8.43000411987
8.455991745
--
/ Peter Schuller
#!/usr/bin/env python
# ugly hack, reader beware
import os
import sys
import time
COUNT = 25
BLOCK = ''.join('x' for _ in xrange(8*1024))
f = file(sys.argv[1], 'w')
for _ in xrange(COUNT)
" files?
Those are bloom filters:
http://wiki.apache.org/cassandra/ArchitectureOverview
http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
--
/ Peter Schuller
t;roughly consistent" is not very precise, and the original
performance on the RAID:ed device is probably also roughly consistent
with this ;)
--
/ Peter Schuller
reads or
distinct get_slice() calls; I am not expecting ordering guarantees
w.r.t. visibility within a single get_slice(). (Additionally I am
assuming QUOROM or RF=1, else it would not be useful to rely on
anyway.)
--
/ Peter Schuller
API) are sorted. Maybe there was a
client in between that did not preserve the sorting (I forget which
thread that was).
(I'm pretty sure my unit tests would have blown up by now if they're
not, but you never know...)
--
/ Peter Schuller
lexicographically previous to b? Is it aaa? Is it
aa? Whatever you pick there will always be a
column with one additional a.
In a less general case where you impose a length limit on the column
name, you're fine. But not in the general case.
--
/ Peter Schuller
en compacted, regardless of GC and whether they still remain on
disk.
--
/ Peter Schuller
of byte strings, because there is no defined maximum
possible length so the lexicographically "previous" column name might
be infinitely long).
--
/ Peter Schuller
a 'nodetool compact' was not clearing up
diskspace after bulk deletion even with GCGraceSeconds set to 0.)
--
/ Peter Schuller
, my perhaps misuse of the term "obsolete" only refers
to sstables that have been successfully compacted together with others
into a new sstable which replaces the old ones (hence making them
"obsolete"). I do not mean to imply that they contain obsolete
columns.
--
/ Peter Schuller
think, represent on-disk size. So nevermind on this bit.
--
/ Peter Schuller
x27;re reaching the 5-7 GB per node level after a cleanup has
completed fully and all obsolete sstables have been removed, does not
necessarily help you since each future cleanup/compaction will
typically double your diskspace anyway (even if temporarily).
--
/ Peter Schuller
eep, thus triggering the deletion of old sstables,
presumably the cleanup itself would produce new sstables that would
then have to wait anyway. Unless there is some code path to avoid
doing that if nothing at all changes in the sstables as a result of
the cleanup... I don't know.)
--
/ Peter Schuller
ch is in order that would log the incident with an
explanation rather than cause exception leakage that gives the
impression of something being wrong. Thoughts?
--
/ Peter Schuller
ed the data directory in between but not the
commit logs?
--
/ Peter Schuller
re how that happened since stress.py has a proper
shebang, including in cassandra 0.6.3.
Are you running it with 'python'? Is this an old cassandra that maybe
didn't have a shebang in it's py_stress (I don't know, grasping at
straws)?
--
/ Peter Schuller
d for your platform if you're on *nix. Debian is distinctly
lacking though if you're running stable, IIRC.
--
/ Peter Schuller
> Can somebody please give steps to run cassandra stess.py program.
Assuming you have the thrift compiler installed, something like (from
memory, not tested):
cd contrib/py_stress
thrift --gen py:new ../../interface/cassandra.thrift
export PYTHONPATH=$(pwd)/gen-py
python stress.py
--
/ Pe
n several things
specific to your situation, including CPU setup, storage system, data
size, locality of access, working set size, etc.
I suggest benchmarking your particular use-case.
(Not that I wouldn't find it useful to have benchmarks for certain
archetypical cases from which you can then better infer what to expect
in a particular situation.)
--
/ Peter Schuller
JVM mismatch seems likely.
--
/ Peter Schuller
confirm
> it yet)
Are you running out of memory (java heap)? If you're running cassandra
with default options, it will be running with
-XX:+HeapDumpOnOutOfMemoryError
Have you checked the cassandra system.log for garbage collection
messages? What is in the last minute or two of logs?
--
/ Peter Schuller
oking at how much memory is used after a
concurrent mark/sweep as just finished is a good way of doing this)
and then set -Xmx accordingly instead of at 12 GB.
--
/ Peter Schuller
I have about a 25-40% failure rate when the hang occurs.
Are the 1% failures you normally experience (pre-freeze) timeouts?
--
/ Peter Schuller
art - did
it start right after a schema change (keyspace addition)?
(I'm just grasping at straws based on a cursory examination; I may be
barking up the wrong tree completely.)
--
/ Peter Schuller
Index: src/java/org/apache/cassan
blocking on, before so much
time has passed that you have thousands of threads waiting.
If it *is* a matter of parts of files being unreadable it may be
easier to spot using standard I/O rather than mmap():ed I/O since you
should clearly see it blocking on a read in that case.
--
/ Peter Schuller
to control (at least weekly) the independence of resources. This
includes at least machine instances and things like EBS volumes. This
would be useful not only for scaling purposes but also redundancy
purposes. I wonder which cloud provider will be the first (or is there
one already?) to provide something like that.
--
/ Peter Schuller
speed of writes.
--
/ Peter Schuller
rse that depends on the nature of
the writes and how expensive they are relative to RTT and/or RPC.
FWIW, whenever I have needed a hard "maximum of X per second" rate
limit I have implemented or re-used a rate limiter (e.g. a token
bucket) for the language in question and used it in my client code.
--
/ Peter Schuller
tten values. If you're overwriting with
larger values, it will no longer be a "doubling" relative to the
actual live data set.
Julie, did you do over-writes or was your disk space measurements
based on the state of the cluster after an initial set of writes of
unique values?
--
/ Peter Schuller
y briefly. Until, for example, a TCP connection stalls and your
entire event loop hangs due to a blocking read.
Apologies if I'm misunderstanding what you're trying to do.
--
/ Peter Schuller
ng in cache
and not having to go down to disk). If this is the case, increasing
read concurrency should at least make the actual problem more obvious
(i.e., achieving CPU saturation), though it probably won't increase
throughput much unless Cassandra is very friendly to
hyperthreading....
--
/ Peter Schuller
sed the fact that ROW-READ-STAGE
had 4096 pending and 8 active... oh well.)
--
/ Peter Schuller
> @Todd, I noticed some new ops in your cassandra.in.sh. Is there any
> documentation on what these ops are, and what they do?
>
> For instance AggressiveOpts, etc.
A fairly complete list is here:
http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp
--
/ Peter Schuller
ministic in when a
particular object may be collected. Has anyone deployed Cassandra with
G1 on very large heaps under real load?)
--
/ Peter Schuller
501 - 600 of 657 matches
Mail list logo