I gave MySQL 5.1.38 a try (that's not the clustered version, so nevermind what 
happens if a disk is lost) and I saw the same persistently increasing latency 
that I saw with Cassandra.

I also tried storing into files on the local filesystem (not clustered or 
transactional, so nevermind what happens if a node fails permanently or 
temporarily).  It apparently came to an equilibrium after 47000 seconds, or 
around 13 hours.  It's still running so I'll watch it some more before really 
believing that the latency won't increase more.

It looks like I'll have to let benchmarks run for a day to determine whether I 
have true long-term performance numbers.  Yuck.

See the attached chart comparing them.  The vertical scale is milliseconds of 
latency per read, and the horizontal scale is seconds.  We're reading and 
writing 350K records, 100Kb each, around 650 reads per minute and 780 writes 
per minute on one 4 core Z600 with one disk drive.  The "write 2 files" part is 
reminding me of a performance bug when writing to the filesystem.
 
Here's where I come into disagreement with the following:

From: Jonathan Ellis [mailto:[email protected]] 
>I agree that [bounding the compaction backlog is] much much nicer 
>in the sense that it makes it more
>obvious what the problem is (not enough capacity) but it only helps
>diagnosis, not mitigation.

The next thing I would try is to use the local filesystem as a cache in front 
of some distributed database, perhaps Cassandra.  This is a use case where we 
could win if we had slow writes but will lose if there's an unbounded commit 
backlog with slow reads.  This is an attempt to solve a real problem, not 
something that's contrived to win a debate.

Suppose I use the local filesystem as a cache in front of Cassandra.  The 
application would write to the local filesystem and read from the local 
filesystem.  We have a background task that persists records from the local 
filesystem to Cassandra.  When reading, if the data isn't on the local 
filesystem, we get it from Cassandra.  The read load to Cassandra is 
eliminated, except when a replacement node is starting up.  The write load to 
Cassandra can be as low as we want because we can update the file on the local 
filesystem several times while persisting it to Cassandra only once.  There are 
details omitted here; the description here has the same performance (I hope) as 
the real idea but more bugs.  Nevermind the bugs for now.

If Cassandra writes things quickly, then we have fresh data in Cassandra, so we 
have to redo relatively little work when a node fails because less data was 
lost.

If Cassandra is slow to write but has bounded read time, then we have more 
stale data in Cassandra.  When a node is replaced, it can read a stale version 
of the data on the original node, and we have to redo more work when a node 
fails.  No big deal.

If Cassandra allows fast writes but is accumulating an unbounded backlog of 
uncompacted files, then I'm in trouble.  The unbounded backlog either fills up 
disks, or it leads to reads taking forever and when it comes time to recover 
from a node failing we can't really read the data that was persisted when the 
node was up.  Or perhaps I throttle writes to Cassandra based on guesses about 
how fast it can safely go.  The problem with that is that different nodes can 
probably go different speeds, and any throttling code I write would be outside 
of Cassandra so it would have to throttle based on the slowest node and 
discovering that would be awkward.  I don't know yet what ratio in speed to 
expect between nodes.

Tim Freeman
Email: [email protected]
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and 
Thursday; call my desk instead.)


-----Original Message-----
From: Jonathan Ellis [mailto:[email protected]] 
Sent: Friday, December 04, 2009 9:14 PM
To: [email protected]
Subject: Re: Persistently increasing read latency

On Fri, Dec 4, 2009 at 10:40 PM, Thorsten von Eicken <[email protected]> 
wrote:
>>> For the first few hours of my load test, I have enough I/O.  The problem
>>> is that Cassandra is spending too much I/O on reads and writes and too
>>> little on compactions to function well in the long term.
>>>
>>
>> If you don't have enough room for both, it doesn't matter how you
>> prioritize.
>>
>
> Mhhh, maybe... You're technically correct. The question here is whether
> cassandra degrades gracefully or not. If I understand correctly, there are
> two ways to look at it:
>
> 1) it's accepting a higher request load than it can actually process and
> builds up an increasing backlog that eventually brings performance down far
> below the level of performance that it could sustain, thus it fails to do
> the type of early admission control or back-pressure that keeps the request
> load close to the sustainable maximum performance.
>
> 2) the compaction backlog size is a primary variable that has to be exposed
> and monitored in any cassandra installation because it's a direct indicator
> for an overload situation, just like hitting 100% cpu or similar would be.
>
> I can buy that (2) is ok, but (1) is certainly nicer.

I agree that it's much much nicer in the sense that it makes it more
obvious what the problem is (not enough capacity) but it only helps
diagnosis, not mitigation.

<<attachment: comparison.png>>

Reply via email to