On Jun 3, 2015, at 9:19 PM, Xueming Shen <xueming.s...@oracle.com> wrote:
> On 06/03/2015 08:53 AM, Paul Sandoz wrote: >> Hi, >> >> Please review an optimization for Files.lines for certain charsets: >> >> http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8072773-File-lines/webrev/ >> >> If a charset is say US-ASCII or UTF-8 it is possible to implement an >> efficient splitting Spliterator that scans bytes from a mid-point to search >> for line feed characters. >> >> Splitting uses a mapped byte buffer. Traversal uses FileChannel.reads at an >> offset. In previous incarnations i tried to use mapped byte buffer for both, >> but for some reason the traversal performance was not good (both on Mac and >> x86). In any case i am happy with the current approach as there is minimal >> layering between the FileChannel and BufferedReader leveraged to read the >> lines. >> >> Sequential performance is similar (same or better) than the current >> approach. Parallel performance is much better than the current approach. >> >> Some advice on two aspects would be most appreciated: >> >> 1) Is there an easy way to determine the sub-set of supported charsets that >> are applicable? >> > > It's easy though a little heavy :-) Thanks, that is a little heavy, but i suppose computed values for charsets could be stashed in a static CHM. Paul. > getLFCR returns a byte[] for the "byte" form of > the \n and \r in a particular encodings, if each one of them can be mapped > into > one byte. Then we can use b[0] for \n and b[1] for \r in trySplit(). This > makes > the new fast version work for most of charsets. > > private static byte[] getLFCR(Charset cs) { > try { > if (cs.canEncode()) { > ByteBuffer bb = cs.newEncoder() > .encode(CharBuffer.wrap(new char[] { '\n', > '\r' })); > if (bb.remaining() == 2) { > CharBuffer cb = cs.newDecoder().decode(bb); > if (cb.remaining() == 2 && > cb.get() == '\n' && cb.get() == '\r') { > bb.flip(); > byte[] ba = new byte[2]; > bb.get(ba); > return ba; > } > } > } > } catch (Exception x) {} > return null; > > } > > -sherman