[
https://issues.apache.org/jira/browse/CASSANDRA-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275865#comment-14275865
]
Ariel Weisberg edited comment on CASSANDRA-6809 at 1/13/15 8:20 PM:
--------------------------------------------------------------------
Moving conversation to JIRA
Ariel
{quote}
What is the test coverage, unit or otherwise for compressed/uncompressed
playback? It would be good to see the new code produce the mutations out of
both compressed and uncompressed segments.
I think that this kind of performance work isn't done without a micro benchmark
so we know where we stand and so the next person to work on the CL can track
the impact of their changes. I would really like to see a JMH test added that
evaluates the compressed and uncompressed versions for throughput for tiny
mutations as well as "normal" (say 1k).
When testing compression I have found that generating compressible data doesn't
seem to work well. I never got the advertised compression rates out of Snappy.
You might need to grab a real corpus like some HTML or the text of alice and
wonderland to give the compressor something reasonable to chew on.
I have some concern about waiting to do compression until the end. It's fine in
the common case and configuration, but there are situations where you want to
start the flush immediately and it would be nice if it compressed and wrote a
few hundred kilobytes at a time. You haven't taken away the mapped option so
latency sensitive users can opt for that and it can be enhanced later if there
is demand.
Compression is single threaded and any time spent in force() can't also be used
to do compression. This can really cut into compression throughput and it is a
function of the latency of force() which varies. With synchronous commits it's
can have an impact.
The ICompressor interface and some of the implementations are also a little
limiting. You should be able to compress the data directly into the memory
mapped file. Snappy java supports that although LZ4 doesn't appear to. It also
makes more sense to compress and then checksum. There is less data to checksum
after compression, and another advantage of getting the order right is that you
can use a JNI call to a native checksum implementation which will be much
faster. You can use a larger checksum like murmur3 over each compressed block.
The checksum will be better then a 4-byte checksum, but use less space overall.
Yet another advantage is that Murmur3 doesn't have a lookup table evicting
random cache lines (on every core) as part of every single mutation that flows
through C*.
I only bring this stuff up because this is a format change and getting it all
done at once might make the amount of multi-version support code that has to be
written smaller as opposed to making these changes incrementally across
multiple versions.
{quote}
Benedict
{quote}
I'm not really sure JMH is the right tool for disk bound workloads?
To add to Ariel's excellent points, another thing to consider is recycling the
ByteBuffer we use, and potentially also use a DirectByteBuffer. As it stands
this will create a non-trivial amount of garbage that may live long enough to
be promoted (what with being disk bound).
{quote}
Ariel
{quote}
JMH doesn't solve the hard problems of for disk bound workloads certainly. It
doesn't hurt either and it has some support for parameterizing and reporting. I
didn't know about CommitLogStress when I asked about JMH. We don't need to
build something new if something is already in place.
I thought about the ByteBuffer pooling and lack of DBB. ICompressor doesn't
accept DBB for input or output. The allocation is large so it will go straight
into the old generation. C* runs CMS pretty regularly, but maybe not that
regularly. It might make sense to pool two of them and if anymore are needed
beyond that to allocate and discard. Since the allocation is CL segment size
it's feels like a lot of memory to pin for one purpose.
{quote}
Benedict
{quote}
CommitLogStress is not very good, but JMH is architected around microbenchmarks
or very small operations, which this doesn't really fit.
We now support DBB in LZ4, so there's no reason to not roll it out here.
Pooling as many as we need makes sense to me, no reason to constrain ourselves
to just 2? Perhaps periodically prune if instances aren't being used. We're
currently pinning as much memory as we're ahead of disk by, although admittedly
here we only need it so long as we're writing to the page cache. Which is
perhaps another optimisation - the buffer should be recycled/discarded as soon
as we've compressed its contents.
{quote}
was (Author: aweisberg):
Moving conversation to JIRA
Ariel
{noformat}
What is the test coverage, unit or otherwise for compressed/uncompressed
playback? It would be good to see the new code produce the mutations out of
both compressed and uncompressed segments.
I think that this kind of performance work isn't done without a micro benchmark
so we know where we stand and so the next person to work on the CL can track
the impact of their changes. I would really like to see a JMH test added that
evaluates the compressed and uncompressed versions for throughput for tiny
mutations as well as "normal" (say 1k).
When testing compression I have found that generating compressible data doesn't
seem to work well. I never got the advertised compression rates out of Snappy.
You might need to grab a real corpus like some HTML or the text of alice and
wonderland to give the compressor something reasonable to chew on.
I have some concern about waiting to do compression until the end. It's fine in
the common case and configuration, but there are situations where you want to
start the flush immediately and it would be nice if it compressed and wrote a
few hundred kilobytes at a time. You haven't taken away the mapped option so
latency sensitive users can opt for that and it can be enhanced later if there
is demand.
Compression is single threaded and any time spent in force() can't also be used
to do compression. This can really cut into compression throughput and it is a
function of the latency of force() which varies. With synchronous commits it's
can have an impact.
The ICompressor interface and some of the implementations are also a little
limiting. You should be able to compress the data directly into the memory
mapped file. Snappy java supports that although LZ4 doesn't appear to. It also
makes more sense to compress and then checksum. There is less data to checksum
after compression, and another advantage of getting the order right is that you
can use a JNI call to a native checksum implementation which will be much
faster. You can use a larger checksum like murmur3 over each compressed block.
The checksum will be better then a 4-byte checksum, but use less space overall.
Yet another advantage is that Murmur3 doesn't have a lookup table evicting
random cache lines (on every core) as part of every single mutation that flows
through C*.
I only bring this stuff up because this is a format change and getting it all
done at once might make the amount of multi-version support code that has to be
written smaller as opposed to making these changes incrementally across
multiple versions.
{noformat}
Benedict
{noformat}
I'm not really sure JMH is the right tool for disk bound workloads?
To add to Ariel's excellent points, another thing to consider is recycling the
ByteBuffer we use, and potentially also use a DirectByteBuffer. As it stands
this will create a non-trivial amount of garbage that may live long enough to
be promoted (what with being disk bound).
{noformat}
Ariel
{noformat}
JMH doesn't solve the hard problems of for disk bound workloads certainly. It
doesn't hurt either and it has some support for parameterizing and reporting. I
didn't know about CommitLogStress when I asked about JMH. We don't need to
build something new if something is already in place.
I thought about the ByteBuffer pooling and lack of DBB. ICompressor doesn't
accept DBB for input or output. The allocation is large so it will go straight
into the old generation. C* runs CMS pretty regularly, but maybe not that
regularly. It might make sense to pool two of them and if anymore are needed
beyond that to allocate and discard. Since the allocation is CL segment size
it's feels like a lot of memory to pin for one purpose.
{noformat}
Benedict
{noformat}
CommitLogStress is not very good, but JMH is architected around microbenchmarks
or very small operations, which this doesn't really fit.
We now support DBB in LZ4, so there's no reason to not roll it out here.
Pooling as many as we need makes sense to me, no reason to constrain ourselves
to just 2? Perhaps periodically prune if instances aren't being used. We're
currently pinning as much memory as we're ahead of disk by, although admittedly
here we only need it so long as we're writing to the page cache. Which is
perhaps another optimisation - the buffer should be recycled/discarded as soon
as we've compressed its contents.
{noformat}
> Compressed Commit Log
> ---------------------
>
> Key: CASSANDRA-6809
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6809
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Benedict
> Assignee: Branimir Lambov
> Priority: Minor
> Labels: performance
> Fix For: 3.0
>
> Attachments: logtest.txt
>
>
> It seems an unnecessary oversight that we don't compress the commit log.
> Doing so should improve throughput, but some care will need to be taken to
> ensure we use as much of a segment as possible. I propose decoupling the
> writing of the records from the segments. Basically write into a (queue of)
> DirectByteBuffer, and have the sync thread compress, say, ~64K chunks every X
> MB written to the CL (where X is ordinarily CLS size), and then pack as many
> of the compressed chunks into a CLS as possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)