[jira] [Commented] (LUCENE-8406) Make ByteBufferIndexInput public

Dawid Weiss (JIRA) Tue, 17 Jul 2018 12:21:13 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547006#comment-16547006
 ]


Dawid Weiss commented on LUCENE-8406:
-------------------------------------

Thanks for comments. 

My initial experiment with implementing an alternative store for RamDirectory 
was very local (out of Solr/Lucene codebase) -- I modified RAMOutputStream and 
the corresponding input (in various ways) so that it assumed write-once mode 
(much like the index is written). I essentially don't allow concurrent readers 
of a RamDirectory file -- once the file is written and flushed, it is available 
for readers, but not before then. 

I then looked at modifying this in the codebase and the current complexity of 
those classes (and the congestion on locking) is a result of how these classes 
are used in many other places (as temporary buffers, essentially). This would 
have to be cleaned up first, I think, and there are comments in the code (by 
Mike) about proliferation of "writeable byte pool" classes for which the common 
functionality should perhaps be extracted and then reused. Namely: PagedBytes, 
ByteBlockPool,BytesStore ByteArrayDataOutput,GrowableByteArrayDataOutput, 
RAMIndexInput/Output... Perhaps more. They're not always identical, but there's 
definitely a recurring pattern. 

I'll try to chip at all this slowly, time permitting.

The initial look at the code also brought the reflection that if we make 
BBGuard, buffer management and the cleaner interface public but *not* make the 
buffer's native cleaner available we will force anyone downstream to reinvent 
the wheel Uwe has been so patient to figure out (different ways to clean up 
buffers in different JVM versions). If we do make that MMap's cleaner 
accessible we open up a lot of internal workings of how Lucene handles this 
stuff, which isn't good either, so I'm at crossroads here.

Perhaps I can start small by trying to clean up those RAMDirectory streams. 
It'd be very temping (for me, personally) to have a RAMDirectory that could 
allocate larger block chunks outside of the heap (in direct memory pools) -- if 
I can think of making it cleanly within the class then perhaps package-private 
scope for everything else is fine and one still has the flexibility of working 
with native buffers without caring about the details. 

I'll try to take a look at all this, although the number of references to 
ramdirectory from pretty much everywhere is sort of scary.





> Make ByteBufferIndexInput public
> --------------------------------
>
>                 Key: LUCENE-8406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8406
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: 6.7
>
>
> The logic of handling byte buffers splits, their proper closing (cleaner) and 
> all the trickery involved in slicing, cloning and proper exception handling 
> is quite daunting. 
> While ByteBufferIndexInput.newInstance(..) is public, the parent class 
> ByteBufferIndexInput is not. I think we should make the parent class public 
> to allow advanced users to make use of this (complex) piece of code to create 
> IndexInput based on a sequence of ByteBuffers.
> One particular example here is RAMDirectory, which currently uses a custom 
> IndexInput implementation, which in turn reaches to RAMFile's synchronized 
> methods. This is the cause of quite dramatic congestions on multithreaded 
> systems. While we clearly discourage RAMDirectory from being used in 
> production environments, there really is no need for it to be slow. If 
> modified only slightly (to use ByteBuffer-based input), the performance is on 
> par with FSDirectory. Here's a sample log comparing FSDirectory with 
> RAMDirectory and the "modified" RAMDirectory making use of the ByteBuffer 
> input:
> {code}
> 14:26:40 INFO  console: FSDirectory index.
> 14:26:41 INFO  console: Opened with 299943 documents.
> 14:26:50 INFO  console: Finished: 8.820 s, 240000 matches.
> 14:26:50 INFO  console: RAMDirectory index.
> 14:26:50 INFO  console: Opened with 299943 documents.
> 14:28:50 INFO  console: Finished: 2.012 min, 240000 matches.
> 14:28:50 INFO  console: RAMDirectory2 index (wrapped byte[] buffers).
> 14:28:50 INFO  console: Opened with 299943 documents.
> 14:29:00 INFO  console: Finished: 9.215 s, 240000 matches.
> 14:29:00 INFO  console: RAMDirectory2 index (direct memory buffers).
> 14:29:00 INFO  console: Opened with 299943 documents.
> 14:29:08 INFO  console: Finished: 8.817 s, 240000 matches.
> {code}
> Note the performance difference is an order of magnitude on this 32-CPU 
> system (2 minutes vs. 9 seconds). The tiny performance difference between the 
> implementation based on direct memory buffers vs. those acquired via 
> ByteBuffer.wrap(byte[]) is due to the fact that direct buffers access their 
> data via unsafe and the wrapped counterpart uses regular java array access 
> (my best guess).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8406) Make ByteBufferIndexInput public

Reply via email to