[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986630#action_12986630 ] Michael McCandless commented on LUCENE-1574: I've been testing on a 25M doc index (all of en Wikipedia, at least as of March 2010). Yes, I think likely alloc of big BitVector, System.arraycopy, destroying it, may be a fairly low cost compared to lucene resolving the deleted term, indexing the doc, flushing the tiny segment, etc. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 4.0 Attachments: LUCENE-1574.patch Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986637#action_12986637 ] Jason Rutherglen commented on LUCENE-1574: -- bq. I think likely alloc of big BitVector, System.arraycopy, destroying it, may be a fairly low cost compared to lucene resolving the deleted term, indexing the doc, flushing the tiny segment, etc. Right, and if we pool the byte[]s we'd take the cost of instantiating and GC'ing out of the picture in the high NRT throughput case. It's counter intuitive and will require testing. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 4.0 Attachments: LUCENE-1574.patch Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12985877#action_12985877 ] Michael McCandless commented on LUCENE-1574: bq. I'm curious how we plan on handling this case? I think we should keep the replay log smallish, or, expire it aggressively with age. I suspect this opto is only going to be worth it for very frequent reopens... but I'm not sure yet. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 4.0 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12985975#action_12985975 ] Jason Rutherglen commented on LUCENE-1574: -- What size segments is the benchmark deleting against? Maybe we're underestimating the speed of arraycopy, eg, it's really a hardware operation that could be optimized? PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 4.0 Attachments: LUCENE-1574.patch Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12985408#action_12985408 ] Jason Rutherglen commented on LUCENE-1574: -- We want to record the deletes between getReader calls however there's no way to know ahead of time if a term or query is going to hit many documents or not, meaning we can't always save del docids, because we'd have too many ints in RAM. I'm curious how we plan on handling this case? PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 4.0 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12985091#action_12985091 ] Michael McCandless commented on LUCENE-1574: I'm working on an initial patch for this... bq. I think the only open question is how we'll shrink the pool, most likely there'd be an expiration on the pooled objects. I think we can simply have a max pooled free bit vectors... or we may want to expire by time/staleness as well. bq. With RT, the parallel arrays will grow, so the pool will need to be size based, eg, when the arrays are grown, all of the previous arrays may be forcefully evicted, or they may simply expire. True... but, like the other per-doc arrays, the BV can be overallocated (ArrayUtil.oversize) to accommodate further added docs. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 4.0 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12985150#action_12985150 ] Jason Rutherglen commented on LUCENE-1574: -- ThreadPoolExecutor can act as a guide, it's main parameters are corePoolSize, maximumPoolSize, keepAliveTime. In regards to using System.arraycopy for the RT parallel arrays, if they grow to become too large, then SC could become a predominant cost. However if the default thread states is 8, which'd mean that many DWPTs, the arrays would probably never grow to be too large for their SC to become noticeably expensive, hopefully. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 4.0 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12984996#action_12984996 ] Jason Rutherglen commented on LUCENE-1574: -- I'm going to revive this, and if it works fine for trunk, then we can use the basic system for RT eg, LUCENE-2312. I think the only open question is how we'll shrink the pool, most likely there'd be an expiration on the pooled objects. With RT, the parallel arrays will grow, so the pool will need to be size based, eg, when the arrays are grown, all of the previous arrays may be forcefully evicted, or they may simply expire. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 4.0 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776576#action_12776576 ] Jason Rutherglen commented on LUCENE-1574: -- A likely optimization for this patch is we'll only pool if the doc count is above a threshold, 100,000 seems like a good number. Also pooling will be optional. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776299#action_12776299 ] Jason Rutherglen commented on LUCENE-1574: -- Yonik, Do you recommend using the method in SimpleStringInterner for lockless pooling? PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776321#action_12776321 ] Jason Rutherglen commented on LUCENE-1574: -- I suppose as we're on Java 1.5, ConcurrentLinkedQueue can be used. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737950#action_12737950 ] John Wang commented on LUCENE-1574: --- Re: Zoie and deleted docs: That is no longer true, Zoie is using a bloom filter over a intHash set from fastutil for exactly the perf reason Jason pointed. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695130#action_12695130 ] Jason Rutherglen commented on LUCENE-1574: -- True the pool would hold onto spares, but they would expire. It's mostly useful for the large on disk segments as those byte arrays (for BitVectors) are large, and because there's more docs in them would get hit with deletes more often, and so they'd be reused fairly often. I'm not knowledgeable enough to say whether the transactional data structure will be fast enough. We had been using http://fastutil.dsi.unimi.it/docs/it/unimi/dsi/fastutil/ints/IntR BTreeSet.html in Zoie for deleted docs and it's way slow. Binary search of an int array is faster, albeit not fast enough. The multi dimensional array thing isn't fast enough (for searching) as we implemented this in Bobo. It's implemented in Bobo because we have a multi value field cache (which is quite large because for each doc we're storing potentially 64 or more values in an inplace bitset) and a single massive array kills the GC. In some cases this is faster than a single large array because of the way Java (or the OS?) transfers memory around through the CPU cache. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org