the oldest might be)
so that you can bound your index lookup = that value (and avoid the
tombstones)?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
it up
(4) (quickly) use nodetool to decrease the cache sizes *on that particular node*
Once that's done everywhere and everything is stable, a permanent and
cluster-wide change of cache sizes can be done by a schema change as
usual.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
heap.
I did a quick calculation and each 2 billion column with RF=2 should
result in bloom filter sizes of ~ 3.7 gig per node... I probably am
overlooking something though since you say you have 4 GB of RAM,
unless you have your heap size set to almost 4 gig?
--
/ Peter Schuller (@scode, http
if it matters). That might be consistent with what you're
seeing; BF:s taking almost all of available heap size, so pretty easy
to cause OOM:s by throwing traffic on it.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
(You might be helped by
http://wiki.apache.org/cassandra/LargeDataSetConsiderations btw - it's
not entirely up to date by now... I will re-try remembering to update
it.)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
start having to
scale to lots and lots of data (or writes).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
maintaining the same overall throughput or is there a
feedback mechanism such that when queries have high latency the
request rate decreases?
* Which data points are you using to consider something degraded?
What's matching in the QUORUM and LOCAL_QUOROM w/o dynsnitch cases?
--
/ Peter Schuller
If i kill Java process and log flush time is set to 10 seconds, will be last
10 seconds lost?
Up to, plus whatever delay resulting from scheduling jitter/overload/etc.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
lost, you might.
But on the other hand if you want a guarantee you shouldn't be using
ONE anyway since a node can crash at any time.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
nodetool drain flushes memtables before returning. its probably safe to kill
it after drain without waiting for commit log flush.
Ooops, sorry about that. I overlooked the drain. Sorry for the misinformation!
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
normally does.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
is for a CDN; and in your
case you imply that it's really *not* a CDN and it's just an analogy
;)
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
. This fundamentally and necessarily means that the data on
the node you brought down will be unavailable.
If you want to survive node failures, use an RF above 1. And then make
sure to use an appropriate consistency level.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
used QUORUM or ALL, any read or write would fail (QUORUM of 2 is 2).
Probably a common setup is to use RF=3 because it allows you to
survive a node going down, while also allowing you to use QUORUM. But,
whether that matters will be up to your use-case.
--
/ Peter Schuller (@scode, http
writes on QUORUM at
least, even if not reads, you also avoid the problem of an
acknowledged write disappearing in case of node failures.
Are you in a position where the nodes in the cluster are wide apart
(e.g. different DC:s), for the writes to be a significant problem for
latency?
--
/ Peter
by double checking exactly which row key(s) are being
written to/read from, and whether they are truly not on the node(s)
that are down.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
of nodes, seems like kind of a
high price to pay.
Obviously I don't know fully your situation so take that with a grain
of salt. Just my knee-jerk reaction from what I've read so far.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
I'm missing something, I've actually yet to actually
measure it myself ;)).
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
on situation you may not want to be using the CFQ I/O
scheduler in the kernel to begin with (required for ionice to work);
and for another you may want to throttle the process lower than
available idle I/O bandwidth because of the effects on page cache
eviction.
--
/ Peter Schuller (@scode
so how about disk io? is there anyway to use ionice to control it?
I have tried to adjust the priority by ionice -c3 -p [cassandra pid].
seems not working...
Compaction throttling (and in 1.0 internode streaming throttling) both
address disk I/O.
--
/ Peter Schuller (@scode on twitter)
.
Google/check wiki/read docs about NetworkTopologyStrategy and
PropertyFileSnitch. I don't have a good link to multi-dc off hand
(anyone got a good link to suggest that goes through this?).
--
/ Peter Schuller (@scode on twitter)
heard that recommendation before but it makes sense
to me).
--
/ Peter Schuller (@scode on twitter)
compaction is taking place?
It would. I don't think anyone intended to say otherwise; but the
setting itself does not directly change the behavior of flushes or
reads. I think that's what jbellis meant. I.e., there is no throttling
of flushes or reads.
--
/ Peter Schuller (@scode on twitter)
compaction that is part of repair is subject to
compaction throttling.
The streaming of sstables afterwards is not however. In 1.0 there is
thottling of streaming:
https://issues.apache.org/jira/browse/CASSANDRA-3080
--
/ Peter Schuller (@scode on twitter)
change the cache sizes transiently on a single node by
using nodetool, but there is no need to do that if all you're after is
to avoid reading in row cache on startup. Simply removing the saved
caches is enough.
--
/ Peter Schuller (@scode on twitter)
I just discovered that using host names for seed nodes in cassandra.yaml do
not work. This is done on purpose?
I believe so yes, to avoid relying on DNS to map correctly given that
everything else is based on IP address. (IIRC, someone chime in if
there is a different reason.)
--
/ Peter
like
that. Also, if you're e.g. doing QUORUM at RF=3, if a node is down for
legitimate reasons, another node having a 6 second pause will by
necessity cause high latency for requests during that period.
[1] http://answerpot.com/showthread.php?1558705-CMS+initial+mark+pauses
--
/ Peter Schuller
to change the JVM settings
in cassandra-env.sh to use a smaller young-gen, but that will probably
cause overall GC cost to go up as more data is promoted into old gen.
--
/ Peter Schuller (@scode on twitter)
-2699
And yes to the one who asked: 'cleanup' only removes data that is not
supposed to be on the node; repair transferes data that *should* be on
the node, so only a compaction will cut down the size after a repair
induced spike of load (data size).
--
/ Peter Schuller (@scode on twitter)
might wait for that,
or apply the fix and build from source.
--
/ Peter Schuller (@scode on twitter)
: Applying that fix should fix your problem.
--
/ Peter Schuller (@scode on twitter)
0.8.6.)
--
/ Peter Schuller (@scode on twitter)
OutOfMemory exceptions. Cassandra by default
runs with -XX:+HeapDumpOnOutOfMemory (or some such, I forget exactly
what it's called). If this is the case, you probably need to increase
your heap size or adjust Cassandra settings.
--
/ Peter Schuller (@scode on twitter)
.
--
/ Peter Schuller (@scode on twitter)
the reference.
--
/ Peter Schuller (@scode on twitter)
is recommended. However, if you're
on 0.7.4, beware of
https://issues.apache.org/jira/browse/CASSANDRA-3166
--
/ Peter Schuller (@scode on twitter)
I am using 0.7.4. so it is always okay to do the routine repair on
Column Family basis? thanks!
It's okay but won't do what you want; due to a bug you'll see
streaming of data for other column families than the one you're trying
to repair. This will be fixed in 1.0.
--
/ Peter Schuller
on the other node (was it restarted at the
time for example, potentially wth a bad conf?)
(3) are these two single-instance cassandras that have never
participated in another cluster?
--
/ Peter Schuller (@scode on twitter)
with lots of memory and a lot of disk space. But... even so.
I would not recommend huge 48 tb nodes... unless you really know what
you're doing.
In reality, more information about your use-case would be required to
offer terribly useful advice.
--
/ Peter Schuller (@scode on twitter)
with RF 1? If the latter, you need to use QUORUM consistency
level to ensure that a read sees your write.
If it's a single node and not a pycassa / client issue, I don't know off hand.
--
/ Peter Schuller (@scode on twitter)
/LargeDataSetConsiderations
(some details slightly out of date by now, I should try to update it)
Honestly though, I don't recommend going for nodes that big unless you
want to spend time dealing with issues you'll hit.
--
/ Peter Schuller (@scode on twitter)
with it. Resident pages will be counted towards the
resident size of the process, so that you can see high resident
amounts even without memory leaks.
(E.g. observe results with https://github.com/scode/alloctest)
--
/ Peter Schuller (@scode on twitter)
snippet obsolete, if there is already a
value inserted with a timestamp in the future.
--
/ Peter Schuller (@scode on twitter)
will be missing data without knowing it's
missing data. So for example (but not limited to) a read at CL.ONE
that goes to that node will fail to return data, or maybe return old
data if the missing data files contained newer versions of data that
exists elsewhere in sstables on the node.
--
/ Peter
as we've replaced
dead nodes.
Is the hadoop job iterating over keys in the cluster in token order
perhaps, and you're generating writes to those keys? That would
explain a moving hotspot along the cluster.
--
/ Peter Schuller (@scode on twitter)
(e.g. out of
disk)?
The only way I can think of over nodes directly affecting the commit
log size on your node would be e.g. hinted handoff resulting in burst
of writes.
--
/ Peter Schuller (@scode on twitter)
.
I don't know, without investigation, why the instructions from the
wiki don't work though. You did the procedure of restarting the node
with the migrations/schema removed, right?
--
/ Peter Schuller (@scode on twitter)
Can you post the complete Cassandra log starting with the initial
start-up of the node after having removed schema/migrations?
--
/ Peter Schuller (@scode on twitter)
to confirm the write made it. Or maybe just
adding more explicit logging to your script (even if it causes some
log flooding) to prove that a write truly happened.
--
/ Peter Schuller (@scode on twitter)
with RF=3, right? Still sounds like a
lot assuming repairs are not running concurrently (and compactions are
able to run after a repair before the next repair of a neighbor
starts).
--
/ Peter Schuller (@scode on twitter)
The compactions ettings do not affect repair. (Thinking out loud, or does it
? Validation compactions and table builds.)
It does.
--
/ Peter Schuller (@scode on twitter)
filters and indexes. Can be CPU or I/O bound (or throttled) -
nodetool compactionstats, htop, iostat -x -k 1
--
/ Peter Schuller (@scode on twitter)
pasted; periodic
repairs are part of regular cluster maintenance. See:
http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
--
/ Peter Schuller (@scode on twitter)
netstats. If there are stuck streams, they might be causing sstable
retention beyond what you'd expect.
--
/ Peter Schuller (@scode on twitter)
rebasing is necessary. You might try a
trunk from further back in time (around the time Stu submitted the
patch).
I'm not quite sure what you're actual problem is though, if it's
source code access then the easiest route is probably to check it out
from https://github.com/apache/cassandra
--
/ Peter
) that
indicates the items you specifically say are processed twice, were in
fact written twice to Cassandra?
--
/ Peter Schuller (@scode on twitter)
is if there
is a very small amount of data (yet non-zero), or significant amounts.
--
/ Peter Schuller (@scode on twitter)
that explains the behavior
being described. Someone correct me if I'm wrong.
--
/ Peter Schuller (@scode on twitter)
take reads until it's done. This requires
that other nodes know that it is joining the ring; if you just pop it
back in they won't know that.
--
/ Peter Schuller (@scode on twitter)
cases, such as
when you're trying to temporarily off-load a node by dissabling
gossip).
--
/ Peter Schuller (@scode on twitter)
discussions before:
https://issues.apache.org/jira/browse/CASSANDRA-2568
--
/ Peter Schuller (@scode on twitter)
of this.
--
/ Peter Schuller (@scode on twitter)
- hundreds of gig increase even if all
neighbors sent all their data.
Can you check system.log for references to these sstables to see when
and under what circumstances they got written?
--
/ Peter Schuller (@scode on twitter)
that ReadStage is usually full (@ your
limit)?
--
/ Peter Schuller (@scode on twitter)
with RF=3,
then for whatever ranges of the ring that node was partially
responsible for, only 2 of the 3 copies will be up/available. They do
not automatically migrate somewhere else.
--
/ Peter Schuller (@scode on twitter)
-ordination outside of Cassandra, there will be the
potential for multiple clients reading and writing without awareness
of each other. Whatever the behavior is, your data model must be such
that this is acceptable.
--
/ Peter Schuller (@scode on twitter)
that cost is not specific to
mmap():ed files (once a given page is in core that is).
(But again, I'm not arguing the point in Cassandra's case; just generally.)
--
/ Peter Schuller (@scode on twitter)
construct a benchmark where there's no difference, yet see a
significant difference in a real-world scenario when your benchmarked
I/O is intermixed with other I/O. Not to mention subtle differences in
behaviors of kernels, RAID controllers, disk drive controllers, etc...
--
/ Peter Schuller (@scode
nodes that run into this.
--
/ Peter Schuller (@scode on twitter)
effects on caches like streaming through all the bf and
index files, restarts are certainly detrimental to the page cache.
Also you may still see some eviction (even if it doesn't *necessarily*
happen) depending (particularly if not running with numactl set to
interleave).
--
/ Peter Schuller
with respect to.)
Clocks should be synchronized, yes. But either your data model is such
that conflicting writes are okay, or you need external co-ordination.
There's not hoping for the best by keeping clocks better in synch.
--
/ Peter Schuller (@scode on twitter)
decrease read performance, as the average row can become more
spread out over multiple sstables. This is one potential driver for
compaction.
--
/ Peter Schuller (@scode on twitter)
Thanks! Then does it mean that before compaction if read call comes for that
key sort is done at the read time since column b, c and a are in different
ssTables.
Essentially yes; a merge-sort happens (since they are sorted locally
in each sstable).
--
/ Peter Schuller (@scode on twitter)
to increase RF.
If you *really* know what you're doing and why you want RF to track
total node count, I'm sure there are *some* cases where this makes
sense. But nothing you've said so far really indicates you're in such
a position.
--
/ Peter Schuller (@scode on twitter)
really have permission
to do the mlockall()?
(Not that I disagree in any way that swap should be disabled, +1 on that.)
--
/ Peter Schuller (@scode on twitter)
.
--
/ Peter Schuller (@scode on twitter)
checking out:
https://issues.apache.org/jira/browse/CASSANDRA-1283
https://issues.apache.org/jira/browse/CASSANDRA-1969
If it can be done via that the nice thing is that you don't loose consistency.
--
/ Peter Schuller (@scode on twitter)
of consistency level *never*
affects *which* nodes are responsible for a given row key, nor does it
affect which rows will eventually receive writes. It *only* affects
how many nodes must respond before the operation (read or write) is
considered successful.
Does that make it clearer?
--
/ Peter
will be
serving the requests.
--
/ Peter Schuller (@scode on twitter)
. It is not the case
that having RF be equal to the cluster size is in and of itself a
useful property.
--
/ Peter Schuller (@scode on twitter)
repair take?
So basically, leave significant margin.
--
/ Peter Schuller
dependent on some marker that isn't
written until commit log synch.)
--
/ Peter Schuller (@scode on twitter)
# wait for a bit until no one is sending it writes anymore
More accurately, until all other nodes have realized it's down
(nodetool ring on each respective host).
--
/ Peter Schuller (@scode on twitter)
?
--
/ Peter Schuller (@scode on twitter)
must be scheduled by the operator to run regularly.
The name repair is a bit unfortunate; it is not meant to imply that
it only needs to run when something is wrong.
--
/ Peter Schuller
new sstables) which is expensive for the
usual reasons with disk I/O; it's major since it covers all data.
The data read is in fact used to calculate a merkle tree for
comparison with neighbors, as claimed.
--
/ Peter Schuller
and your memory is enough to keep the hot set, and you're disk I/O is
coming form the long tail, increasing the amount of cache to 200 gig
may not necessarily give you a huge improvement in terms of
percentages.
--
/ Peter Schuller
be 10 times that of the original cache.
I did a quick Google but didn't find a good piece describing it more
properly, but hopefully the above is helpful. Some related reading
might be http://en.wikipedia.org/wiki/Long_Tail
--
/ Peter Schuller
documented in the link I sent before, unless you have specific
reasons not to and know what you're doing.
--
/ Peter Schuller
--
/ Peter Schuller
/networking load and is not yet rate limited like
compaction.
In addition be aware that repair can cause disk space usage to
temporarily increase if there are significant differences to be
repaired.
--
/ Peter Schuller
to be.
--
/ Peter Schuller
choosing batch commit log sync instead of
periodic if single-node durability or post-quorum-write durability is
a concern.
--
/ Peter Schuller
benefits from this. The only hard requirement is the
repair schedule relative to GC grace time, and that requirement does
not change - just be mindful of the timing of the EBS snapshots and
what that means to your repair schedule.
--
/ Peter Schuller
EBS volume atomicity is good. We've had tons of experience since EBS came
out almost 4 years ago, to back all kinds of things, including large DBs.
And thanks a lot for coming forward with production experience. That
is always useful with these things.
--
/ Peter Schuller
about freeze maybe being probabilistically useful anyway.
--
/ Peter Schuller
this in detail. I should
really start working off the backlog of those blog entries...
--
/ Peter Schuller
certain
goals, including crash consistency. It still relies on the same
fundamental properties of the underlying storage device.
--
/ Peter Schuller
when it starts
up the thrift interface - check system.log.
--
/ Peter Schuller
101 - 200 of 552 matches
Mail list logo