Re: Compact Strings and APIs for fast decoding of string data

Paul Sandoz Wed, 10 Feb 2016 00:56:16 -0800

Hi,

A more functional approach would be to compose a sequence buffers into one 
view, perhaps read-only. Then there would be no need to accept arrays of 
buffers. That should work well for bulk operations. That’s a non-trivial but 
not very difficult amount of work, and possibly simplified if restricted to 
read-only views.


Thus i think we should focus Richard’s work with:

  String(ByteBuffer src, String charset)

and perhaps a sub-range variant, if perturbing the position/limit of an 
existing buffer and/or slicing is too problematic.

—

Zeroing memory and possibly avoiding it can be tricky. Any such optimisations 
have to be carefully performed otherwise uninitialised regions might leak and 
be accessed, nefariously or otherwise. I imagine it’s easier to contain/control 
within a constructor than say a builder.

Paul.

> On 10 Feb 2016, at 05:38, Xueming Shen <xueming.s...@oracle.com> wrote:
> 
> Hi Chris,
> 
> I think basically you are asking a String constructor that takes a 
> ByteBuffer. StringCoding
> then can take advantage of the current CompactString design to optimize the 
> decoding
> operation by just a single byte[]/vectorized memory copy from the ByteBuffer 
> to the String's
> internal byte[], WHEN the charset is 8859-1.
> 
> String(ByteBuffer src, String charset);
> 
> Further we will need a "buffer gathering" style constructor
> 
> String(ByteBuffer[] srcs, String charset);
> (or more generally, String(ByteBuffer[] srcs, int off, int len, String 
> charset)
> 
> to create a String object from a sequence of ByteBuffers, if it's really 
> desired.
> 
> And then I would also assume it will also be desired to extend the current
> CharsetDecoder/Encoder class as well to add a pair of the "gathering" style 
> coding
> methods
> 
> CharBuffer CharsetDecoder.decode(ByteBuffer... ins);
> ByteBuffer CharsetEncoder.encode(CharBuffer... ins);
> 
> Though the implementation might have to deal with the tricky "splitting
> byte/char" issue, in which part of the "byte/char sequence" is in the previous
> buffer and the continuing byte/chars are in the next following buffer ...
> 
> -Sherman
> 
> 
> On 2/9/16 7:20 AM, Chris Vest wrote:
>> Hi,
>> 
>> Aleksey Shipilev did a talk on his journey to implement compact strings and 
>> indified string concat at the JVM Tech Summit yesterday, and this reminded 
>> me that we (Neo4j) have a need for turning segments of DirectByteBuffers 
>> into Strings as fast as possible. If we already store the string data in 
>> Latin1, which is one of the two special encodings for compact strings, we’d 
>> ideally like to produce the String object with just the two necessary object 
>> allocations and a single, vectorised memory copy.
>> 
>> Our use case is that we are a database and we do our own file paging, 
>> effectively having file data in a large set of DirectByteBuffers. We have 
>> string data in our files in a number of different encodings, a popular one 
>> being Latin1. Occasionally these String values span multiple buffers. We 
>> often need to expose this data as String objects, in which case decoding the 
>> bytes and turning them into a String is often very performance sensitive - 
>> to the point of being one of our top bottlenecks for the given queries. Part 
>> of the story is that in the case of Latin1, I’ll know up front exactly how 
>> many bytes my string data takes up, though I might not know how many buffers 
>> are going to be involved.
>> 
>> As far as I can tell, this is currently not possible using public APIs. 
>> Using private APIs it may be possible, but will be relying on the JIT for 
>> vectorising the memory copying.
>> 
>> From an API standpoint, CharsetDecoder is close to home, but is not quite 
>> there. It’s stateful and not thread-safe, so I either have to allocate new 
>> ones every time or cache them in thread-locals. I’m also required to 
>> allocate the receiving CharBuffer. Since I may need to decode from multiple 
>> buffers, I realise that I might not be able to get away from allocating at 
>> least one extra object to keep track of intermediate decoding state. The 
>> CharsetDecoder does not have a method where I can specify the offset and 
>> length for the desired part of the ByteBuffer I want to decode, which forces 
>> be to allocate views instead.
>> 
>> The CharBuffers are allocated with a length up front, which is nice, but I 
>> can’t restrict its encoding so it has to allocate a char array instead of 
>> the byte array that I really want. Even if it did allocate a byte array, the 
>> CharBuffer is mutable, which would force String do a defensive copy anyway.
>> 
>> One way I imagine this could be solved would be with a less dynamic kind of 
>> decoder, where the target length is given upfront to the decoder. Buffers 
>> are then consumed one by one, and a terminal method performs finishing 
>> sanity checks (did we get all the bytes we were promised?) and returns the 
>> result.
>> 
>> StringDecoder decoder = 
>> Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest);
>> String result = decoder.decode(buf1, off1, len1).decode(buf2, off2, 
>> len2).done();
>> 
>> This will in principle allow the string decoding to be 2 small allocations, 
>> an array allocation without zeroing, and a sequence of potentially 
>> vectorised memcpys. I don’t see any potentially troubling interactions with 
>> fused Strings either, since all the knowledge (except for the string data 
>> itself) needed to allocate the String objects are available from the get-go.
>> 
>> What do you guys think?
>> 
>> Btw, Richard Warburton has already done some work in this area, and made a 
>> patch that adds a constructor to String that takes a buffer, offset, length, 
>> and charset. This work now at least needs rebasing: 
>> http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/ 
>> <http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/>
>> It doesn’t solve the case where multiple buffers are used to build the 
>> string, but does remove the need for a separate intermediate state-holding 
>> object when a single buffer is enough. It’d be a nice addition if possible, 
>> but I (for one) can tolerate a small object allocation otherwise.
>> 
>> Cheers,
>> Chris
>> 
>

Re: Compact Strings and APIs for fast decoding of string data

Reply via email to