On 03/06/2015 16:53, Paul Sandoz wrote:
Hi,
Please review an optimization for Files.lines for certain charsets:
http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8072773-File-lines/webrev/
If a charset is say US-ASCII or UTF-8 it is possible to implement an efficient
splitting Spliterator that scans bytes from a mid-point to search for line feed
characters.
Splitting uses a mapped byte buffer. Traversal uses FileChannel.reads at an
offset. In previous incarnations i tried to use mapped byte buffer for both,
but for some reason the traversal performance was not good (both on Mac and
x86). In any case i am happy with the current approach as there is minimal
layering between the FileChannel and BufferedReader leveraged to read the lines.
Sequential performance is similar (same or better) than the current approach.
Parallel performance is much better than the current approach.
Some advice on two aspects would be most appreciated:
1) Is there an easy way to determine the sub-set of supported charsets that are
applicable?
2) We should try and explicitly unmap the mapped byte buffer when the stream is
closed, using some sort of shared secret. How can i do that?
As this code path is only for the default provider case then there's a
good chance that it will be a FileChannelImpl, in which case you can
call its unmap method (directly or via a shared secret). It is possible
to interpose on the default provider so you can't be guaranteed it is a
FileChannelImpl of course.
In passing, you might consider moving ByteBufferLinesSpliterator to its
own source file because Files is getting very big.
-Alan.