[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642479#comment-14642479
 ] 

Benedict edited comment on CASSANDRA-8630 at 7/27/15 9:22 AM:
--------------------------------------------------------------

I think we can avoid static methods by making {{MemoryInputStream}} extend 
{{ByteBufferInput}} or {{NIODataInputStream}} (this is actually pretty easy, 
since we can construct {{ByteBuffer}} from a {{Memory}} instance at very low 
cost, and can "refill"" that {{ByteBuffer}} on a {{readNext}}.

I think this should ideally be based on {{NIODataInputStream}} if that's not 
too challenging (which I hope it should not be), so that we can keep one main 
implementation of {{DataInputPlus}}. And we can indeed drop 
{{AbstractDataInput}} at that point.
 
A somewhat-circular buffer would be ideal. However we cannot currently have > 
64Kb buffers, so for reads using 64Kb already we would have to shrink our read 
size or grow our max buffer size. For smaller buffers, we will still have to 
shuffle bytes around and have quite a bit of wasted allocation (for this to 
work, our reads of objects much smaller than 4Kb would still need 8Kb buffers 
allocated).

I wonder if for primitive reads we shouldn't just have a slow path (like we've 
now opted for with vints) that reads byte-by-byte at a boundary crossing 
(returning a long that can be cast). If we force the method call to _not_ be 
inlined, the cost should be imperceptible with the help of branch prediction, 
since it should almost never be hit (given we probabilistically guarantee 
covering the entire range in our buffer), and will not pollute the icache. This 
has the added bonus of being much easier to understand, and probably easier to 
implement safely. (edit: it also has the advantage of letting us greatly 
simplify the boundary construction for {{MappedByteBuffer}}, since we do not 
need to guarantee that all of a partition is present in one buffer)

However if you think you can deliver a neat circular-buffer approach, feel free 
to give it a try.


was (Author: benedict):
I think we can avoid static methods by making {{MemoryInputStream}} extend 
{{ByteBufferInput}} or {{NIODataInputStream}} (this is actually pretty easy, 
since we can construct {{ByteBuffer}} from a {{Memory}} instance at very low 
cost, and can "refill"" that {{ByteBuffer}} on a {{readNext}}.

I think this should ideally be based on {{NIODataInputStream}} if that's not 
too challenging (which I hope it should not be), so that we can keep one main 
implementation of {{DataInputPlus}}. And we can indeed drop 
{{AbstractDataInput}} at that point.
 
A somewhat-circular buffer would be ideal. However we cannot currently have > 
64Kb buffers, so for reads using 64Kb already we would have to shrink our read 
size or grow our max buffer size. For smaller buffers, we will still have to 
shuffle bytes around and have quite a bit of wasted allocation (for this to 
work, our reads of objects much smaller than 4Kb would still need 8Kb buffers 
allocated).

I wonder if for primitive reads we shouldn't just have a slow path (like we've 
now opted for with vints) that reads byte-by-byte at a boundary crossing 
(returning a long that can be cast). If we force the method call to _not_ be 
inlined, the cost should be imperceptible with the help of branch prediction, 
since it should almost never be hit (given we probabilistically guarantee 
covering the entire range in our buffer), and will not pollute the icache. This 
has the added bonus of being much easier to understand, and probably easier to 
implement safely.

However if you think you can deliver a neat circular-buffer approach, feel free 
to give it a try.

> Faster sequential IO (on compaction, streaming, etc)
> ----------------------------------------------------
>
>                 Key: CASSANDRA-8630
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core, Tools
>            Reporter: Oleg Anastasyev
>            Assignee: Stefania
>              Labels: compaction, performance
>             Fix For: 3.x
>
>         Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read<Type> and 
> SequencialWriter.write<Type> methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to