On 16/02/2018 22:55, Richard Warburton wrote:
:
I think some of the context here around application level memory management
of the byte buffers is missing. The getBytes() methods were supposed to be
useful in scenarios where you have a String and you want to write the byte
encoded version of it down some kind of connection. So you're going to take
the byte[] or ByteBuffer and hand it off somewhere - either to a streaming
protocol (eg: TCP), a framed message based protocol (eg: UDP, Aeron, etc.)
or perhaps to a file. A common pattern for dealing with this kind of buffer
is to avoid trying to allocate a new ByteBuffer for every message and to
encode onto the existing buffers before writing them into something else,
for example a NIO channel.

The review is completely correct that the API's user needs to know when the
ByteBuffer isn't large enough to deal with the encoding, but I think there
are two strategies for expected API usage here if you run out of space.

1. Find out how much space you need, grow your buffer size and encode onto
the bigger buffer. So this means that in the failure case the user ideally
gets to know how big a buffer you need. I think this still works in terms
of mitigating per message buffer allocation as in practice it means that
you only allocate a larger buffer when a String is encoded that is longer
than any previous String that you've seen before. It isn't strictly
necessary to know how big a buffer is needed btw - as long as failure is
indicated an API user could employ a strategy like double the buffer size
and retry. I think that's suboptimal to say the least, however, and knowing
how big a buffer needs to be is desirable.

2. Just write the bytes that you've encoded down the stream and retry with
an offset incremented by the number of characters written. This requires
that the getBytes() method encodes in terms of whole characters, rather
than running out of space when encoding say a character that takes up
multiple bytes encoded and also takes a "source offset" parameter - say the
number of characters into the String that you are? This would work
perfectly well in a streaming protocol. If your buffer size is N, you
encode max N characters and write them down your Channel in a retry loop.
Anyone dealing with async NIO is probably familiar with the concept of
having a retry loop. It may also work perfectly well in a framed message
based protocol. In practice any network protocol that has fixed-size framed
messages and deals with arbitrary size encodings has to have a way to
fragment longer-length blobs of data into its fixed size messages.

I think either strategy for dealing with failure is valid, the problem is
that if the API uses the return value to indicate failure, which I think is
a good idea in a low-level performance oriented API then its difficult to
offer both choices to the user. (1) needs the failure return code to be the
number of bytes required for encoding. (2) needs the failure return code to
indicate how far into the String you are in order to retry. I suspect given
this tradeoff that Sherman's suggestion of using a -length (required number
of bytes) return value is a good idea and just assuming API users only
attempt (1) as a solution to the too-small-buffer failure.

Just to add that the existing low-level / advanced API for this is CharsetEncoder. The CoderResult from an encode and the buffer positions means you know when there is overflow, the number of characters encoded, and how many bytes were added to the buffer. It also gives fine control on how encoding errors should be handled and you cache a CharsetEncoder to avoid some of the performance anomalies that come up in the Charset vs. charset name discussions. This is not an API that most developers will ever use directly but if the use-case is advanced cases (libraries or frameworks doing their own memory management as you mention above) then it might be an alternative to look at to avoid adding advanced use-case APIs to String. I don't think an encode(String, ByteBuffer) would look out of place although it would need a way to return the characters encoded count as part of the result.

-Alan.

Reply via email to