[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

Jason Rutherglen (JIRA) Tue, 28 Sep 2010 08:56:58 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915797#action_12915797
 ]


Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

{quote}Hmm so we also copy-on-write a given byte[] block? Is
this because JMM can't make the guarantees we need about other
threads reading the bytes written?{quote}

Correct. The example of where everything could go wrong is the
rewriting of a byte slice forwarding address while a reader is
traversing the same slice. The forwarding address could be
half-written, and suddenly we're bowling in lane 6 when we
should be in lane 9. By making a [read-only] ref copy of the
byte[]s we're ensuring that the byte[]s are in a consistent
state while being read.

So I'm using a boolean[] to tell the writer whether it needs to
make a copy of the byte[]. The boolean[] also tells the writer
if it's already made a copy. Whereas in IndexReader.clone we're
keeping ref counts of the norms byte[], and decrementing each
time we make a copy until finally it's 0, and then we give it to
the GC (here we'd do the same or give it back to the allocator). 

{quote}But even if we do reuse, we will cause tons of garbage,
until the still-open readers are closed? Ie we cannot re-use the
byte[] being "held open" by any NRT reader that's still
referencing the in-RAM segment after that segment had been
flushed to disk.{quote}

If we do pool, it won't be very difficult to implement, we have
a single point of check-in/out of the byte[]s in the allocator
class.

In terms of the first implementation, by all means we should
minimize "tricky" areas of the code by not implementing skip
lists and byte[] pooling.

{quote}It's not like 3.x's situation with FieldCache or terms
dict index, for example....{quote}

What's the GC issue with FieldCache and terms dict?

{quote}BTW I'm assuming IW will now be modal? Ie caller must
tell IW up front if NRT readers will be used? Because non-NRT
users shouldn't have to pay all this added RAM cost?{quote}

At present it's still all on demand. Skip lists will require
going modal because we need to build those upfront (well we
could go back and build them on demand, that'd be fun). There's
the term-freq parallel array, however if getReader is never
called, it's a single additional array that's essentially
innocuous, if useful.

{quote}Hmm your'e right that each reader needs a private copy,
to remain truly "point in time". This (4 bytes per unique term X
number of readers reading that term) is a non-trivial addition
of RAM.{quote}

PagedInt time? However even that's not going to help much if in
between getReader calls, 10,000s of terms were seen, we could
have updated 1000s of pages. AtomicIntArray does not help
because concurrency isn't the issue, it's point-in-timeness
that's required. Still I guess PagedInt won't hurt, and in the
case of minimal term freq changes, we'd still be potentially
saving RAM. Is there some other data structure we could pull out
of a hat and use?



> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
> LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

Reply via email to