[ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916086#action_12916086 ]
Michael McCandless commented on LUCENE-2575: -------------------------------------------- {quote} Correct. The example of where everything could go wrong is the rewriting of a byte slice forwarding address while a reader is traversing the same slice. {quote} Ahh right that's a real issue. {quote} bq. It's not like 3.x's situation with FieldCache or terms dict index, for example.... What's the GC issue with FieldCache and terms dict? {quote} In 3.x, the string index FieldCache and the terms index generate tons of garbage, ie allocate zillions of tiny objects. (This is fixed in 4.0). My only point was that having 32 KB arrays as garbage is much less GC load than having the same net KB across zillions of tiny objects... {quote} There's the term-freq parallel array, however if getReader is never called, it's a single additional array that's essentially innocuous, if useful. {quote} Hmm the full copy of the tf parallal array is going to put a highish cost on reopen? So some some of transactional (incremental copy-on-write) data structure is needed (eg PagedInts)... We don't store tf now do we? Adding 4 bytes per unique term isn't innocuous! {quote} OK, I think there's a solution to copying the actual byte[], we'd need to alter the behavior of BBPs. It would require always allocating 3 empty bytes at the end of a slice for the forwarding address, {quote} Good idea -- this'd make the byte[] truly write-once. This would really decrease RAM efficiency low-doc-freq (eg 1) terms, though, because today they make use of those 3 bytes. We'd need to increase the level 0 slice size... {quote} The reason this would work is, past readers that are iterating their term docs concurrently with the change to the posting-upto array, will stop at the maxdoc anyways. This'll be fun to implement. {quote} Hmm... but the reader needs to read 'beyond' the end of a given slice, still? Ie say global maxDoc is 42, and a given posting just read doc 27 (which in fact is its last doc). It would then try to read the next doc? Oh, except, the next byte would be a 0 (because we always clear the byte[]), which [I think] is never a valid byte value in the postings stream, except as a first byte, which we would not hit here (since we know we always have at least a first byte). So maybe we can get by w/o fully copy of postingUpto? > Concurrent byte and int block implementations > --------------------------------------------- > > Key: LUCENE-2575 > URL: https://issues.apache.org/jira/browse/LUCENE-2575 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: Realtime Branch > Reporter: Jason Rutherglen > Fix For: Realtime Branch > > Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, > LUCENE-2575.patch > > > The current *BlockPool implementations aren't quite concurrent. > We really need something that has a locking flush method, where > flush is called at the end of adding a document. Once flushed, > the newly written data would be available to all other reading > threads (ie, postings etc). I'm not sure I understand the slices > concept, it seems like it'd be easier to implement a seekable > random access file like API. One'd seek to a given position, > then read or write from there. The underlying management of byte > arrays could then be hidden? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org