[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868091#comment-15868091
 ] 

Ariel Weisberg commented on CASSANDRA-8457:
-------------------------------------------

{quote}
Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I 
had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus 
that has a backing ByteBuffer that is written to, and when it is filled, it is 
written to the netty context and flushed, then allocate a new buffer for more 
writes - kinda similar to a BufferedOutputStream, but replacing the backing 
buffer when full. Bringing this idea back is also what underpins one of the 
major performance things I wanted to address: buffering up smaller messages 
into one buffer to avoid going back to the netty allocator for every tiny 
buffer we might need - think Mutation acks.
{quote}

What thread is going to be writing to the output stream to serialize the 
messages? If it's a Netty thread you can't block inside a serialization method 
waiting for the bytes to drain to the socket that is not keeping up. You also 
can't wind out of the serialization method and continue it later.

If it's an application thread then it's no longer asynchronous and a slow 
connection can block the application and prevent it from doing say a quorum 
write to just the fast nodes. You would also need to lock during serialization 
or queue concurrently sent messages behind the one currently being written.

With large messages we aren't really fully eliminating the issue only making it 
a factor better. At the other end you still need to materialize a buffer 
containing the message + the object graph you are going to materialize. This is 
different from how things worked previously where we had a dedicated thread 
that would read fixed size buffers and then materialize just the object graph 
from that. 

To really solve this we need to be able to avoid buffering the entire message 
at both sending and receiving side. The buffering is worse because we are 
allocating contagious memory and not just doubling the space impact. We could 
make it incrementally better by using chains of fixed size buffers so there is 
less external fragmentation and allocator overhead. That's still committing 
additional memory compared to pre-8457, but at least it's being committed in a 
more reasonable way.

I think the most elegant solution is to use a lightweight thread 
implementation. What we will probably be boxed into doing is making the 
serialization of result data and other large message portions able to yield. 
This will bound the memory committed to large messages to the largest atomic 
portion we have to serialize (Cell?).

Something like an output stream being able to say "shouldYield". If you 
continue to write it will continue to buffer and not fail, but use memory. Then 
serializers can implement a return value for serialize which indicates whether 
there is more to serialize. You would check shouldYield after each Cell or some 
unit of work when serializing. Most of these large things being serialized are 
iterators. The trick will be that most serialization is stateless, and objects 
are serialized concurrently so you can't stored the serialization state in 
object being serialized safely.

> nio MessagingService
> --------------------
>
>                 Key: CASSANDRA-8457
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>            Assignee: Jason Brown
>            Priority: Minor
>              Labels: netty, performance
>             Fix For: 4.x
>
>
> Thread-per-peer (actually two each incoming and outbound) is a big 
> contributor to context switching, especially for larger clusters.  Let's look 
> at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to