[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

Stefania (JIRA) Tue, 25 Aug 2015 01:56:11 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710918#comment-14710918
 ]


Stefania commented on CASSANDRA-8630:
-------------------------------------

bq. DataInputBuffer line 25, NIODataInputStream no longer has the bytes 
shuffling behavior so that comment should go away.

Not sure if I found the comment you are referring to, you mean the description 
of the class {{Input stream around a fixed ByteBuffer}}? I can change it to 
{{Input stream around a single ByteBuffer}}, is this what you meant?

bq. RebufferingInputStream copy constructor appears unused (or Eclipse is 
lying). It's also looks suspicious since it doesn't inherit the rebuffering 
behavior of whatever it is copying?

Left over from previous attempts, definitely not needed, thanks.

bq. Does ChecksummedDataInput handle files larger than 2 gigabytes? Seems like 
we could end up with large hint files? The way the file based hints loop is 
written it seems like it could do it. Possibly unintentionally.

Each single hint cannot be more than 2GB because its size is encoded as an 
integer (existing behavior). {{ChecksummedDataInput}} should be fine except for 
the handling of limits, which are used only for single hints. I will change 
{{ChecksummedDataInput.limit}} to a long for future safety and move the checked 
cast to {{readHint}}. {{crcPosition}} is an integer but this is fine as it 
caches the last buffer position, which is also an integer. It is reset each 
time we rebuffer.

bq. The CoW idiom used for MmappedRegions seems a little off. It's making a 
copy on read so every SSTableReader (they aren't shared globally I believe) 
will have a separate deep copy of the entire MmappedRegions. I know this is 
tricky and you probably get it better than I do, but can you get it so that the 
same array is shared? Ideally both the arrays and the State object will be 
shared. Looking at how the refcounting is supposed to work

The readers are potentially used in different threads. Once 
Lifecycletransaction.checkpoint() publishes a new view in the tracker, we can 
potentially use the tracker in different threads, due to various asynchronous 
tasks. At least this is what I understand. The builder lives in the 
BigTableWriter, each time we open an sstable early it will create a new sstable 
reader which shares the same builder with the previous readers and the final 
reader - hence they share the channel and the mmapped regions if any. The 
readers created in open early are published and obsoleted by the lifecycle 
transaction so MT access is possible. This is my understanding, perhaps 
[~benedict] can fill any gaps?

The whole idea was to ensure copies do not modify the arrays. It's true that we 
could share the arrays as we enforce the non-modification via the {{isCopy}} 
flag, so I am happy to avoid the deep copy. Once we implement compaction of 
mmapped regions, this will become trickier and sharing arrays may not be 
possible any longer. However, in this case we'll also have bigger problems like 
tracking the old non-compacted regions that are still used.

bq. The fact that MmappedRegions and it's owning MmappedSegmentedFile both are 
SharedClosables seems odd to me. Seems like only one of them needs to determine 
the lifetime of the whole shebang.

The segmented files and the builders own the MmappedRegions, just like for the 
channel. Sometimes the builders are closed before the segmented files, and the 
mmapped regions need to survive. It's exactly the same as for the channel. 
These two resources, channel and mmapped regions, need to have their own 
resource management independent of their owners.

bq. For rate limiting. It seems like we acquire buffer size from the rate 
limiter at a time. What is the potential distribution of buffer sizes and how 
reasonable are they? It seems like they can vary with the statistics of a file. 
Since we got into trouble with rate limiting once I just want to be sure there 
isn't a corner case where it can be a problem again.

Buffer sizes can vary with the statistics of the files, since CASSANDRA-8894. 
For data files they are a multiple of the page size and approximately equal to 
the 95th percentile of the partition size. So far we've throttled based on the 
buffer size and not at all for mmapped segments, we also had fixed buffer sizes 
of 64k up to CASSANDRA-8894. What alternatives would we have if we did not 
throttle based on the buffer size? We cannot throttle for every read method in 
{{RebufferingInputStream}} or can we?

I will work on the missing tests and submit the remaining code fixes tomorrow, 
thanks!

> Faster sequential IO (on compaction, streaming, etc)
> ----------------------------------------------------
>
>                 Key: CASSANDRA-8630
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core, Tools
>            Reporter: Oleg Anastasyev
>            Assignee: Stefania
>              Labels: compaction, performance
>             Fix For: 3.x
>
>         Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz, flight_recorder_002_files.tar.gz, 
> mmaped_uncomp_hotspot.png
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read<Type> and 
> SequencialWriter.write<Type> methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

Reply via email to