On 06/03/2015 08:53 AM, Paul Sandoz wrote:
Hi,

Please review an optimization for Files.lines for certain charsets:

   http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8072773-File-lines/webrev/

If a charset is say US-ASCII or UTF-8 it is possible to implement an efficient 
splitting Spliterator that scans bytes from a mid-point to search for line feed 
characters.

Splitting uses a mapped byte buffer. Traversal uses FileChannel.reads at an 
offset. In previous incarnations i tried to use mapped byte buffer for both, 
but for some reason the traversal performance was not good (both on Mac and 
x86). In any case i am happy with the current approach as there is minimal 
layering between the FileChannel and BufferedReader leveraged to read the lines.

Sequential performance is similar (same or better) than the current approach. 
Parallel performance is much better than the current approach.

Some advice on two aspects would be most appreciated:

1) Is there an easy way to determine the sub-set of supported charsets that are 
applicable?


It's easy though a little heavy :-) getLFCR returns a byte[] for the "byte" 
form of
the \n and \r in a particular encodings, if each one of them can be mapped into
one byte. Then we can use b[0] for \n and b[1] for \r in trySplit(). This makes
the new fast version work for most of charsets.

    private static byte[] getLFCR(Charset cs) {
        try {
            if (cs.canEncode()) {
                ByteBuffer bb = cs.newEncoder()
                                  .encode(CharBuffer.wrap(new char[] { '\n', 
'\r' }));
                if (bb.remaining() == 2) {
                    CharBuffer cb = cs.newDecoder().decode(bb);
                    if (cb.remaining() == 2 &&
                        cb.get() == '\n' && cb.get() == '\r') {
                        bb.flip();
                        byte[] ba = new byte[2];
                        bb.get(ba);
                        return ba;
                    }
                }
            }
        } catch (Exception x) {}
        return null;

    }

-sherman

Reply via email to