Re: 0.7.2 slow memtables flushing

Ivan Georgiev Mon, 21 Feb 2011 11:27:26 -0800

That is strange. In 0.7.0 i see this for seek:

public void seek(long pos) throws IOException
{
this.curr_ = pos;
}


Ivan

On 21.2.2011 г. 21:20 ч., Jonathan Ellis wrote:

BRAF.seek has not changed since 0.7.0.  Here is the implementation:

     public void seek(long newPosition) throws IOException
     {
         current = newPosition;

         if (newPosition>= bufferEnd || newPosition<  bufferOffset)
         {
             reBuffer(); // this will set bufferEnd for us
         }

         final int delta = (int) (newPosition - bufferOffset);
         buffer.position(delta);
     }

In other words: if we are seeking outside of the part of the file that
is currently buffered, we have to call rebuffer.

Now, you might wonder why seek is being called at all in SSTW.append.
Here is the relevant part (also unchanged since 0.7.0):

         // write placeholder for the row size, since we don't know it yet
         long sizePosition = dataFile.getFilePointer();
         dataFile.writeLong(-1);
         // write out row data
         int columnCount =
ColumnFamily.serializer().serializeWithIndexes(cf, dataFile);
         // seek back and write the row size (not including the size Long 
itself)
         long endPosition = dataFile.getFilePointer();
         dataFile.seek(sizePosition);
         dataFile.writeLong(endPosition - (sizePosition + 8));

So basically once your row size is larger than the BRAF buffer you can
expect to do a lot of rebuffer calls during flush or compact.  Sounds
to me like the difference you are seeing is caused by your row size
and not a regression in Cassandra.

2011/2/21 Ivan Georgiev<yngw...@bk.ru>:

Some more digging. This is the code path causing the excessive rebuffer()
calls.

java.lang.Exception: Stack trace
    at java.lang.Thread.dumpStack(Unknown Source)
    at
org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:204)
    at
org.apache.cassandra.io.util.BufferedRandomAccessFile.seek(BufferedRandomAccessFile.java:394)
    at
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:148)
    at
org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:159)
    at org.apache.cassandra.db.Memtable.access$000(Memtable.java:49)
    at org.apache.cassandra.db.Memtable$1.runMayThrow(Memtable.java:174)
    at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Basically the rebuffer() is being called through seek(). In 0.7.0 seek()
does not call rebuffer() while 0.7.1 and on does
and that is where the performance degradation is happening in my test case.
Any ideas how to alleviate the problem ?

Ivan

On 21.2.2011 г. 14:07 ч., Ivan Georgiev wrote:

I did some very rough measurements in a desperate attempt to see if I can
find the issue (if there is an issue).
Since I dont know the code base well enough i chose
BufferedRandomAccessFile as my suspect, since it was rewritten from 0.7.0 to
0.7.1

I did rough measurements on how many times write() and rebuffer() were
called, and how much time was spent in them. The measurements
were printed out when close() on the file was called. Time measurements
are rounded in seconds. I have included the measurements only for the
large files that were flushed to disk.

0.7.0
Total time in reBuffer : 0 seconds.
Total times called reBuffer : 2
Total time in write : 0 seconds.
Total times called write : 579374

0.7.2
Time spent in reBuffer : 67 seconds.
Times called reBuffer : 30888
Time spent in write : 0 seconds.
Times called write : 1884107

Seems like rebuffer is being called a lot more in 0.7.2 and thats where
the performance degradation is coming from.

rebuffer can end up being called from write itself, but since the time
spent in write is insignificant, that means this excessive
calling of rebuffer in 0.7.2 is not happening from the write path.

Ivan


On 20.2.2011 г. 05:21 ч., Ivan Georgiev wrote:

Is it possible that the changes to the BufferedRandomAccessFile.java
could be causing the issue ?
I think the most notable change there is using ByteBuffer instead of a
byte[] for the buffer.

I will run more tests and see what comes out of it.

Ivan

On 20.2.2011 г. 05:03 ч., Jonathan Ellis wrote:

I guess you'll need to binary search through the 0.7.1 changes to find
what made the difference.  I can't think of any obvious candidates.

On Sat, Feb 19, 2011 at 10:32 AM, Ivan Georgiev<yngw...@bk.ru>   wrote:

On 19.2.2011 г. 16:43 ч., Jonathan Ellis wrote:

Flush code didn't change between 0.7.0 and 0.7.2. There must be some
other variable here. Memory pressure maybe?

Cloud you please elaborate on that one ?
The conditions are exactly the same for the test with 0.7.0 and 0.7.2.
By the way, 0.7.1 tests are similar to 0.7.2, while 0.7 early betas to
0.7.0
is fine.

Ivan

On 19.2.2011 г. 16:43 ч., Jonathan Ellis wrote:

Flush code didn't change between 0.7.0 and 0.7.2. There must be some
other variable here. Memory pressure maybe?

On Sat, Feb 19, 2011 at 8:16 AM, Ivan Georgiev<yngw...@bk.ru>
  wrote:

Hi,

I am testing 0.7.2 on a Windows 2003 x64 system(one node) and I am
having
the following problem.
My insertion speed is relatively slow, so the memtables do not get
full
and
the actual flushing is triggered by memtable_flush_after_mins, this
happens
on the hour mark. My problem with 0.7.2 is that when that happens,
the
cpu
spikes to 25% overall usage (x4 Xeon) and the operation takes
anywhere
from
2 to 4 minutes, leaving the node not responding during that time.
This
has
forced to me to increase the rpc timeout option to beyond what i feel
comfortable with.
I have run multiple tests with 0.7.0 and 0.7.2 with the same dataset
and
the
results are consistent. During the same operation 0.7.0 takes about
10
seconds to complete vs. 2 to 4 minutes for 0.7.2. I am attaching a
log
with
the timestamps from one such flushing of 0.7.2. Please let me know if
there
is anything i can do to speed up and get results similar to 0.7.0.

Regards:
Ivan

This is the log of the operations which took most time during the
flush
operation. Using 0.7.0, with the same number of operations the
flushing
takes less than 10 seconds.

INFO 01:36:44,906 Enqueuing flush of
Memtable-ArchiveFiles.6f776e65724944@1225856921(1619949 bytes, 34467
operations)
INFO 01:37:47,375 Completed flushing
C:\Storage\data\Storage\ArchiveFiles.68617368-f-3-Data.db (5549187
bytes)
INFO 01:37:47,375 Writing
Memtable-ArchiveFiles.6e616d65@978152661(1619949
bytes, 34467 operations)
INFO 01:37:47,375 Enqueuing flush of
Memtable-ArchiveFiles.706172656e74466f6c6465724944@2097700961(1619949
bytes,
34467 operations)
INFO 01:38:51,343 Completed flushing
C:\Storage\data\Storage\ArchiveFiles.6e616d65-f-3-Data.db (3820265
bytes)
INFO 01:38:51,343 Writing
Memtable-ArchiveFiles.6f776e65724944@1225856921(1619949 bytes, 34467
operations)

Re: 0.7.2 slow memtables flushing

Reply via email to