Hi, A more functional approach would be to compose a sequence buffers into one view, perhaps read-only. Then there would be no need to accept arrays of buffers. That should work well for bulk operations. That’s a non-trivial but not very difficult amount of work, and possibly simplified if restricted to read-only views.
Thus i think we should focus Richard’s work with: String(ByteBuffer src, String charset) and perhaps a sub-range variant, if perturbing the position/limit of an existing buffer and/or slicing is too problematic. — Zeroing memory and possibly avoiding it can be tricky. Any such optimisations have to be carefully performed otherwise uninitialised regions might leak and be accessed, nefariously or otherwise. I imagine it’s easier to contain/control within a constructor than say a builder. Paul. > On 10 Feb 2016, at 05:38, Xueming Shen <xueming.s...@oracle.com> wrote: > > Hi Chris, > > I think basically you are asking a String constructor that takes a > ByteBuffer. StringCoding > then can take advantage of the current CompactString design to optimize the > decoding > operation by just a single byte[]/vectorized memory copy from the ByteBuffer > to the String's > internal byte[], WHEN the charset is 8859-1. > > String(ByteBuffer src, String charset); > > Further we will need a "buffer gathering" style constructor > > String(ByteBuffer[] srcs, String charset); > (or more generally, String(ByteBuffer[] srcs, int off, int len, String > charset) > > to create a String object from a sequence of ByteBuffers, if it's really > desired. > > And then I would also assume it will also be desired to extend the current > CharsetDecoder/Encoder class as well to add a pair of the "gathering" style > coding > methods > > CharBuffer CharsetDecoder.decode(ByteBuffer... ins); > ByteBuffer CharsetEncoder.encode(CharBuffer... ins); > > Though the implementation might have to deal with the tricky "splitting > byte/char" issue, in which part of the "byte/char sequence" is in the previous > buffer and the continuing byte/chars are in the next following buffer ... > > -Sherman > > > On 2/9/16 7:20 AM, Chris Vest wrote: >> Hi, >> >> Aleksey Shipilev did a talk on his journey to implement compact strings and >> indified string concat at the JVM Tech Summit yesterday, and this reminded >> me that we (Neo4j) have a need for turning segments of DirectByteBuffers >> into Strings as fast as possible. If we already store the string data in >> Latin1, which is one of the two special encodings for compact strings, we’d >> ideally like to produce the String object with just the two necessary object >> allocations and a single, vectorised memory copy. >> >> Our use case is that we are a database and we do our own file paging, >> effectively having file data in a large set of DirectByteBuffers. We have >> string data in our files in a number of different encodings, a popular one >> being Latin1. Occasionally these String values span multiple buffers. We >> often need to expose this data as String objects, in which case decoding the >> bytes and turning them into a String is often very performance sensitive - >> to the point of being one of our top bottlenecks for the given queries. Part >> of the story is that in the case of Latin1, I’ll know up front exactly how >> many bytes my string data takes up, though I might not know how many buffers >> are going to be involved. >> >> As far as I can tell, this is currently not possible using public APIs. >> Using private APIs it may be possible, but will be relying on the JIT for >> vectorising the memory copying. >> >> From an API standpoint, CharsetDecoder is close to home, but is not quite >> there. It’s stateful and not thread-safe, so I either have to allocate new >> ones every time or cache them in thread-locals. I’m also required to >> allocate the receiving CharBuffer. Since I may need to decode from multiple >> buffers, I realise that I might not be able to get away from allocating at >> least one extra object to keep track of intermediate decoding state. The >> CharsetDecoder does not have a method where I can specify the offset and >> length for the desired part of the ByteBuffer I want to decode, which forces >> be to allocate views instead. >> >> The CharBuffers are allocated with a length up front, which is nice, but I >> can’t restrict its encoding so it has to allocate a char array instead of >> the byte array that I really want. Even if it did allocate a byte array, the >> CharBuffer is mutable, which would force String do a defensive copy anyway. >> >> One way I imagine this could be solved would be with a less dynamic kind of >> decoder, where the target length is given upfront to the decoder. Buffers >> are then consumed one by one, and a terminal method performs finishing >> sanity checks (did we get all the bytes we were promised?) and returns the >> result. >> >> StringDecoder decoder = >> Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest); >> String result = decoder.decode(buf1, off1, len1).decode(buf2, off2, >> len2).done(); >> >> This will in principle allow the string decoding to be 2 small allocations, >> an array allocation without zeroing, and a sequence of potentially >> vectorised memcpys. I don’t see any potentially troubling interactions with >> fused Strings either, since all the knowledge (except for the string data >> itself) needed to allocate the String objects are available from the get-go. >> >> What do you guys think? >> >> Btw, Richard Warburton has already done some work in this area, and made a >> patch that adds a constructor to String that takes a buffer, offset, length, >> and charset. This work now at least needs rebasing: >> http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/ >> <http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/> >> It doesn’t solve the case where multiple buffers are used to build the >> string, but does remove the need for a separate intermediate state-holding >> object when a single buffer is enough. It’d be a nice addition if possible, >> but I (for one) can tolerate a small object allocation otherwise. >> >> Cheers, >> Chris >> >