Re: FST and FieldCache?
You cannot get a string out of automaton by its ordinal without storing additional data. The string is stored there not as a single arc, but as a sequence of them (basically.. err.. as a string), so referencing them is basically writing the string asis. Space savings here come from sharing arcs between strings. Though, it's possible to do if you associate an additional number with each node. (I invented some way, shared it with Mike and forgot.. good grief :/) Perfect hashing, on the other hand, is like a MapString, Integer that accepts a predefined set of N strings and returns an int in 0..N-1 interval. And it can't do the reverse lookup, by design, that's a lossy compression for all good perfect hashing algos. So, it's irrelevant here, huh? On Thu, May 19, 2011 at 08:53, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I've been pondering how to reduce the size of FieldCache entries when there are a large number of Strings. I'd like to facet on such a field with Solr but with less memory. As I understand it, FSTs are a highly compressed representation of a set of Strings (among other possibilities). The fieldCache would need to point to an FST entry (an arc?) using something small, say an integer. Is there a way to point to an FST entry with an integer, and then somehow with relative efficiency construct the String from the arcs to get there? ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2960030.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
I think, if we add ord as an output to the FST, then it builds everything we need? Ie no further data structures should be needed? Maybe I'm confused :) If you put the ord as an output the common part will be shifted towards the front of the tree. This will work if you want to look up a given value assigned to some string, but will not work if you need to look up the string from its value. The latter case can be solved if you know which branch to take while descending from root and the shared prefix alone won't give you this information. At least I don't see how it could. I am familiar with the basic prefix hashing procedure suggested by Daciuk (and other authors), but maybe some progress has been made there, I don't know... the one I know is really conceptually simple -- since each arc encodes the number of leaves (or input sequences) in the automaton, you know which path must lead you to your string. For example if you have a node like this and seek for the 12-th term: 0 -- 10 -- ... +- 10 -- ... +- 5 -- .. you look at the first path, it'd give you terms 1..10, then the next one contains terms 11..20 so you add 10 to an internal counter which is added to further computations, descend and repeat the procedure until you find a leaf node. Dawid There's a possible speedup here. If, instead of storing the count of all downstream leaves, you store the sum of counts for all previous siblings, you can do a binary lookup instead of linear scan on each node. Taking your example: 0 -- 0 -- ... +- 10 -- ... We know that for 12-th term we should descend along this edge, as it has the biggest tag less than 12. +- 15 -- ... That's what I invented, and yes, it was invented by countless people before :) -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 16:45, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: That's what I invented, and yes, it was invented by countless people before :) You know I didn't mean to sound rude, right? I'm really admiring your ability to come up with these solutions by yourself, I'm merely copying other folks' ideas. I tried to prevent another reference to mr. Daciuk :) Anyway, the optimization you're describing is sure possible. Lucene's FST implementation can actually combine both approaches because always expanding nodes is inefficient and those already expanded will allow a binary search (assuming the automaton structure is known to the implementation). Another refinement of this idea creates a detached table (err.. index :) of states to start from inside the automaton, so that you don't have to go through the initial 2-3 states which are more or less always large and even binary search is costly there. Dawid But you have to lookup this err..index somehow. And that's either binary or hash lookup. Where's the win? -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
This is more about compressing strings in TermsIndex, I think. And ability to use said TermsIndex directly in some cases that required FieldCache before. (Maybe FC is still needed, but it can be degraded to docId-ord map, storing actual strings in TI). This yields fat space savings when we, eg, need to both lookup on a field and build facets out of it. mmap is cool :) What I want to see is a FST-based TermsDict that is simply mmaped into memory, without building intermediate indexes, like Lucene does now. And docvalues are orthogonal to that, no? On Thu, May 19, 2011 at 17:22, Jason Rutherglen jason.rutherg...@gmail.com wrote: maybe thats because we have one huge monolithic implementation Doesn't the DocValues branch solve this? Also, instead of trying to implement clever ways of compressing strings in the field cache, which probably won't bare fruit, I'd prefer to look at [eventually] MMap'ing (using DV) the field caches to avoid the loading and heap costs, which are signifcant. I'm not sure if we can easily MMap packed ints and the shared byte[], though it seems fairly doable? On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote: 2011/5/19 Michael McCandless luc...@mikemccandless.com: Of course, for certain apps that perf hit is justified, so probably we should make this an option when populating field cache (ie, in-memory storage option of using an FST vs using packed ints/byte[]). or should we actually try to have different fieldcacheimpls? I see all these missions to refactor the thing, which always fail. maybe thats because we have one huge monolithic implementation. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 20:43, Michael McCandless luc...@mikemccandless.com wrote: On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM? I don't know of a straight up comparison... I did compare MMapDir vs RAMDir variant a couple of years ago. Searches slowed down a teeny-weeny little bit. GC times went down noticeably. For me it was a big win. Whatever Mike might say, mmap is great for latency-conscious applications : ) If someone tries to create artificial benchmark for byte[] VS ByteBuffer, I'd recommend going through Lucene's abstraction layer. If you simply read/write in a loop, JIT will optimize away boundary checks for byte[] in some cases. This didn't ever happen to *Buffer family for me. There's also RAM based SSDs whose performance could be comparable with well, RAM. True, though it's through layers of abstraction designed originally for serving files off of spinning magnets :) Also, with our heap based field caches, the first sorted search requires that they be loaded into RAM. Then we don't unload them until the reader is closed? With MMap the unloading would happen automatically? True, but really if the app knows it won't need that FC entry for a long time (ie, long enough to make it worth unloading/reloading) then it should really unload it. MMap would still have to write all those pages to disk... DocValues actually makes this a lot cheaper because loading DocValues is much (like ~100 X from Simon's testing) faster than populating FieldCache since FieldCache must do all the uninverting. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
On Thu, May 19, 2011 at 21:44, Chris Hostetter hossman_luc...@fucit.org wrote: : I think we should focus on everything that's *infrastructure* in 4.0, so : that we can develop additional features in subsequent 4.x releases. If we : end up releasing 4.0 just to discover many things will need to wait to 5.0, : it'll be a big loss. the catch with that approach (i'm speaking generally here, not with any of these particular lucene examples in mind) is that it's hard to know that the infrastructure really makes sense until you've built a bunch of stuff on it -- i think Josh Bloch has a paper where he says that you shouldn't publish an API abstraction until you've built at least 3 *real* (ie: not just toy or example) implementations of that API. it would be really easy to say the infrastructure for X, Y, and Z is all in 4.0, features that leverage this infra will start coming in 4.1 and then discover on the way to 4.1 that we botched the APIs. How do I express my profound love for these words, while remaining chaste? : ) what does this mean concretely for the specific big ticket changes that we've got on trunk? ... i dunno, just my word of caution. : we just started the discussion about Lucene 3.2 and releasing more : often. Yet, I think we should also start planning for Lucene 4.0 soon. : We have tons of stuff in trunk that people want to have and we can't : just keep on talking about it - we need to push this out to our users. I agree, but i think the other approach we should take is to be more agressive about reviewing things that would be good candidates for backporting. If we feel like some feature has a well defined API on trunk, and it's got good tests, and people have been using it and filing bugs and helping to make it better then we should consider it a candidate for backporting -- if the merge itself looks like it would be a huge pain in hte ass we don't *have* to backport, but we should at least look. That may not help for any of the big ticket infra changes discussed in this thread (where we know it really needs to wait for a major release) but it would definitely help with the get features out to users faster issue. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fuzzy search always returning docs sorted by the highest match
You aren't likely to encounter strings like abc company inc in Lucene index, as it will be tokenized into three tokens abc, company, inc under most Analyzers. So, for this exact example you don't even need fuzzy matching. Also, maybe you should try 'user' mailing list for questions regarding the use of Lucene. On Wed, May 18, 2011 at 00:54, Guilherme Aiolfi grad...@gmail.com wrote: I'm re-sending my first message because I've just received the mailing-list confirmation. If it's a duplicated, forget about this one. Hi, I want to do a fuzzy search and always return documents no matter what the score. So, to do this, I'm tried sorting by strdist() in solr 3.1. It worked great and does ALMOST exactly what I wanted. The problem is that the algorithms supported jw, ngram and edit are not the best fit for my scenario. The best results come from StrikeAMatch (http://www.devarticles.com/c/a/Development-Cycles/How-to-Strike-a-Match/). So, I've found this link https://issues.apache.org/jira/browse/LUCENE-2230 that implemented what I wanted. But I was told that I should use trunk because there were some really great news about fuzzy search there. I read this article explaining some changes http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html. But I still don't think it replaces the StrikeAMatch algo, because that one can have best results in searches like abc comparing to strings like abc company inc (distance 2). But still, Fuad Efendi told me that StrikeAMatch is toys for kids compare to the state of lucene trunk. So here I'm, I want to know how 4.0 will help achieve what I want. Thanks. -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr JIRA
+1 to Chris. Even if the code is partially shared and project is the same, the end products are completely different. Merging lists/jira will force niche developers/users to manually sift through heaps of irrelevant emails/issues. On Thu, May 19, 2011 at 00:53, Chris Hostetter hossman_luc...@fucit.org wrote: : just a few words. I disagree here with you hoss IMO the suggestion to : merge JIRA would help to move us closer together and help close the : gap between Solr and Lucene. I think we need to start identifying us : with what we work on. It feels like we don't do that today and we : should work hard to stop that and make hard breaks that might hurt but I just don't see how you think that would help anything ... we still need to distinguish Jira issues to identify where in the stack they affect. If there is a divide among the developers because of the niches where they tend to work, will that divide magicly go away because we partition all issues using the component feature of instead of by the Jira project feature? I don't really see how that makes any sense. Even if we all thought it did, and even if the cost/effort of migrating/converting were totally free, the user bases (who interact with the Solr APIs vs directly using the Lucene-Core/Module APIs) are so distinct that I genuinely think sticking with distinct Jira Projects makes more sense for our users. : JIRA. I'd go even further and nuke the name entirely and call : everything lucene - I know not many folks like the idea and it might : take a while to bake in but I think for us (PMC / Committers) and the Everything already is called Lucene ... the Project is Apache Lucene the community is Lucene ... the Lucene project currently releases several products, and one of them is called Apache Solr ... if you're suggestion that we should ultimately elimianate the name Solr then we'd still have to decide what we're going going to call that end product, the artifact that we ship that provides the abstraction layer that Solr currently provides. Even if you mean to suggest that we should only have one unified product -- one singular release artifact -- that abstraction layer still needs a name. The name we have now is Solr, it has brand awareness and a user base who understands what it means to say they are Installing Solr or that a new feature is available when Using Solr Eliminating that name doesn't seem like it would benefit the user community in anyway. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Fuzzy search always returning docs sorted by the highest match
I'm baffled. As probably are you. If all you want is a fuzzy match against a list of strings, Lucene is a huge fat overkill, and you need to look elsewhere. 2011/5/19 Guilherme Aiolfi grad...@gmail.com: Well, it was about the implementation of a algorithm that was purposed by a user and was implemented in another way. And this, and not the user mailing list was recommended by this developer to ask this question. So, not entirely my fault. But I apologize for the inconvenience. I just want to clarify that searching for the tokens separably is not what I want since those words can exist but not all in the same doc. I want to compare the whole phrase. For that to work I not using any Analyzer. As I said, I've got it working, but I don't know how to use the right algorithm for the job. I'm going to redirect my question to the other mailing list. Thanks anyway. On Wed, May 18, 2011 at 6:32 PM, Earwin Burrfoot ear...@gmail.com wrote: You aren't likely to encounter strings like abc company inc in Lucene index, as it will be tokenized into three tokens abc, company, inc under most Analyzers. So, for this exact example you don't even need fuzzy matching. Also, maybe you should try 'user' mailing list for questions regarding the use of Lucene. On Wed, May 18, 2011 at 00:54, Guilherme Aiolfi grad...@gmail.com wrote: I'm re-sending my first message because I've just received the mailing-list confirmation. If it's a duplicated, forget about this one. Hi, I want to do a fuzzy search and always return documents no matter what the score. So, to do this, I'm tried sorting by strdist() in solr 3.1. It worked great and does ALMOST exactly what I wanted. The problem is that the algorithms supported jw, ngram and edit are not the best fit for my scenario. The best results come from StrikeAMatch (http://www.devarticles.com/c/a/Development-Cycles/How-to-Strike-a-Match/). So, I've found this link https://issues.apache.org/jira/browse/LUCENE-2230 that implemented what I wanted. But I was told that I should use trunk because there were some really great news about fuzzy search there. I read this article explaining some changes http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html. But I still don't think it replaces the StrikeAMatch algo, because that one can have best results in searches like abc comparing to strings like abc company inc (distance 2). But still, Fuad Efendi told me that StrikeAMatch is toys for kids compare to the state of lucene trunk. So here I'm, I want to know how 4.0 will help achieve what I want. Thanks. -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names
[ https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034639#comment-13034639 ] Earwin Burrfoot commented on LUCENE-3105: - StringInterner is in fact faster than CHM. And is compatible with String.intern(), ie - it returns the same String instances. It also won't eat up memory if spammed with numerous unique strings (which is a strange feature, but people requested that). In Lucene 4.0 all of this is moot anyway, fields there are strongly separated and intern() is not used. String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names Key: LUCENE-3105 URL: https://issues.apache.org/jira/browse/LUCENE-3105 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Mark Kristensson Attachments: LUCENE-3105.patch We have one index with several hundred thousand unqiue field names (we're optimistic that Lucene 4.0 is flexible enough to allow us to change our index design...) and found that opening an index writer and closing an index reader results in horribly slow performance on that one index. I have isolated the problem down to the calls to String.intern() that are used to allow for quick string comparisons of field names throughout Lucene. These String.intern() calls are unnecessary and can be replaced with a hashmap lookup. In fact, StringHelper.java has its own hashmap implementation that it uses in conjunction with String.intern(). Rather than using a one-off hashmap, I've elected to use a ConcurrentHashMap in this patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names
[ https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034640#comment-13034640 ] Earwin Burrfoot commented on LUCENE-3105: - Hmm.. Ok, it *is* still used, but that's gonna be fixed, mm? String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names Key: LUCENE-3105 URL: https://issues.apache.org/jira/browse/LUCENE-3105 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Mark Kristensson Attachments: LUCENE-3105.patch We have one index with several hundred thousand unqiue field names (we're optimistic that Lucene 4.0 is flexible enough to allow us to change our index design...) and found that opening an index writer and closing an index reader results in horribly slow performance on that one index. I have isolated the problem down to the calls to String.intern() that are used to allow for quick string comparisons of field names throughout Lucene. These String.intern() calls are unnecessary and can be replaced with a hashmap lookup. In fact, StringHelper.java has its own hashmap implementation that it uses in conjunction with String.intern(). Rather than using a one-off hashmap, I've elected to use a ConcurrentHashMap in this patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032936#comment-13032936 ] Earwin Burrfoot commented on LUCENE-3092: - Chris, I don't like the idea of expanding IOContext again and again, but this case seems in line with intended purporse - give Directory implementation hints as to what we're going to do with it. I don't like events either. They look fragile and binding them to threads is a WTF. With all our pausing/unpausing magic there's no guarantee merge will end on the same thread it started on. bq. Stuff like FlushPolicy could take information about concurrent merges and hold of flushes for a little while if memory allows it etc. Coordinating access to shared resource (IO subsystem) with events is very awkward. Ok, your FlushPolicy receives events from MergePolicy and holds flushes during merge. _Now, when a flush is in progress, should FlushPolicy notify MergePolicy so it can hold its merges?_ It goes downhill from there. What if FP and MP fire events simultaneously? :) What should other listeners do? Try looking at a bigger picture. Merges are not your problem. Neither are flushes. Your problem is that several threads try to take their dump on disk simultaneously (for whatever reason, you don't really care). So what we need is an arbitration mechanism for Directory writes. A mechanism located presumably @ Directory level (eg, we don't need to throttle anything when writing to RAMDir). One possible implementation is that we add a constructor parameter to FSDirectory specifying desired level of IO parallelism, and then it keeps track of its IndexOutputs and stalls writes selectively. We can also add 'expectedWriteSize' to IOContext, so the Directory may favor shorter writes over bigger ones. Instead of 'expectedWriteSize' we can use 'priority'. NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032989#comment-13032989 ] Earwin Burrfoot commented on LUCENE-3092: - bq. but I couldn't disagree more that this is an issue with an Event model There are no issues with event model itself. It's just that this model is badly suitable for this issue's usecase. Event listeners are good. Using them to emulate what is essentially a mutex - is ugly and fragile as hell. bq. We have a series of components in Lucene; Directories, IndexWriter, MergeScheduler etc, and we have some crosscutting concerns such as merges themselves. My point is that for many concerns they shouldn't necessarily be crosscutting. Eg - Directory can support IO priorities/throttling, so it doesn't have to know about merges or flushes. Many OSes have have special APIs that allow IO prioritization, do they know about merges, or Lucene at all? No. NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032997#comment-13032997 ] Earwin Burrfoot commented on LUCENE-3092: - bq. The IOCtx should reference the OneMerge (if in fact this file is being opened because of a merge)? IOCtx should have a value 'expectedSize', or 'priority', or something similar. This does not introduce a transitive dependency of Directory from MergePolicy (to please you once more - a true WTF), and this allows to apply the same logic to flushes. Eg - all small flushes/merges go to cache, all big flushes/merges go straight to disk. NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032046#comment-13032046 ] Earwin Burrfoot commented on LUCENE-3084: - * Speaking logically, merges operate on Sets of SIs, not List? * Let's stop subclassing random things? : ) SIS can contain a List of SIs (and maybe a Set, or whatever we need in the future), and only expose operations its clients really need. MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032099#comment-13032099 ] Earwin Burrfoot commented on LUCENE-3084: - bq. Merges are ordered Hmm.. Why should they be? bq. SegmentInfos itself must be list It may contain list as a field instead. And have a much cleaner API as a consequence. On another note, I wonder, is the fact that Vector is internally synchronized used somewhere within SegmentInfos client code? MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3077) DWPT doesn't see changes to DW#infoStream
[ https://issues.apache.org/jira/browse/LUCENE-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13029881#comment-13029881 ] Earwin Burrfoot commented on LUCENE-3077: - We should just make it final everywhere ... DWPT doesn't see changes to DW#infoStream - Key: LUCENE-3077 URL: https://issues.apache.org/jira/browse/LUCENE-3077 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 DW does not push infostream changes to DWPT since DWPT#infoStream is final and initialized on DWPTPool initialization (at least for initial DWPT) we should push changes to infostream to DWPT too -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: I was accepted in GSoC!!!
By the way, guys. LuSolr SVN repository is mirrored @ git://git.apache.org/lucene-solr.git , which is in turn mirrored @ https://github.com/apache/lucene-solr . Working with git (maybe with stgit) is easier than juggling patches by hand. On Wed, May 4, 2011 at 15:00, David Nemeskey nemeskey.da...@sztaki.hu wrote: Hi Uwe, do you mean one issue per GSoC proposal, or one for every logical unit in the project? If the second: Robert told me to use the flexscoring branch as a base for my project, since preliminary work has already been done in that branch. Should I open JIRA issues nevertheless? Thanks, David On 2011 May 04, Wednesday 09:56:02 Uwe Schindler wrote: Hi Vinicius, Submitting patches via JIRA is fine! We were just thinking about possibly providing some SVN to work with (as additional training), but came to the conclusion, that all students should go the standard Apache Lucene way of submitting patches to JIRA issues. You can of course still use SVN / GIT locally to organize your code. At the end we just need a patch to be committed by one of the core committers. Uwe - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2904) non-contiguous LogMergePolicy should be careful to not select merges already running
[ https://issues.apache.org/jira/browse/LUCENE-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13029403#comment-13029403 ] Earwin Burrfoot commented on LUCENE-2904: - I think we should simply change the API for MergePolicy. Instead of SegmentInfos it should accept a SetSegmentInfo with SIs eligible for merging (eg, completely written not elected for another merge). IW.getMergingSegments() is a damn cheat, and Expert notice is not an excuse! :) Why should each and every MP do the set substraction when IW can do it for them once and for all? non-contiguous LogMergePolicy should be careful to not select merges already running Key: LUCENE-2904 URL: https://issues.apache.org/jira/browse/LUCENE-2904 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2904.patch Now that LogMP can do non-contiguous merges, the fact that it disregards which segments are already being merged is more problematic since it could result in it returning conflicting merges and thus failing to run multiple merges concurrently. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2904) non-contiguous LogMergePolicy should be careful to not select merges already running
[ https://issues.apache.org/jira/browse/LUCENE-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13029408#comment-13029408 ] Earwin Burrfoot commented on LUCENE-2904: - Ok, I'm wrong. We need both a list of all SIs and eligible SIs for calculations. But that should be handled through API change, not a new public method on IW. non-contiguous LogMergePolicy should be careful to not select merges already running Key: LUCENE-2904 URL: https://issues.apache.org/jira/browse/LUCENE-2904 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2904.patch Now that LogMP can do non-contiguous merges, the fact that it disregards which segments are already being merged is more problematic since it could result in it returning conflicting merges and thus failing to run multiple merges concurrently. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3065) NumericField should be stored in binary format in index (matching Solr's format)
[ https://issues.apache.org/jira/browse/LUCENE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13029421#comment-13029421 ] Earwin Burrfoot commented on LUCENE-3065: - It's sad NumericFields are hardbaked into index format. Eg - I have some fields that are similar to Numeric in that they are 'stringified' binary structures, and they can't become first-class in the same manner as Numeric. NumericField should be stored in binary format in index (matching Solr's format) Key: LUCENE-3065 URL: https://issues.apache.org/jira/browse/LUCENE-3065 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch (Spinoff of LUCENE-3001) Today when writing stored fields we don't record that the field was a NumericField, and so at IndexReader time you get back an ordinary Field and your number has turned into a string. See https://issues.apache.org/jira/browse/LUCENE-1701?focusedCommentId=12721972page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12721972 We have spare bits already in stored fields, so, we should use one to record that the field is numeric, and then encode the numeric field in Solr's more-compact binary format. A nice side-effect is we fix the long standing issue that you don't get a NumericField back when loading your document. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027612#comment-13027612 ] Earwin Burrfoot commented on LUCENE-3041: - The static cache is now not threadsafe. And original had nice diagnostics for ambigous dispatches. Why not just take it and cut over to JDK reflection and CHM? Support Query Visting / Walking --- Key: LUCENE-3041 URL: https://issues.apache.org/jira/browse/LUCENE-3041 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch Out of the discussion in LUCENE-2868, it could be useful to add a generic Query Visitor / Walker that could be used for more advanced rewriting, optimizations or anything that requires state to be stored as each Query is visited. We could keep the interface very simple: {code} public interface QueryVisitor { Query visit(Query query); } {code} and then use a reflection based visitor like Earwin suggested, which would allow implementators to provide visit methods for just Querys that they are interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027612#comment-13027612 ] Earwin Burrfoot edited comment on LUCENE-3041 at 5/2/11 10:30 AM: -- The static cache is now not threadsafe. And original had nice diagnostics for ambigous dispatches. Why not just take it and cut over to JDK reflection and CHM? Same can be said for tests. What about throwing original invocation exception instead of the wrapper? Since we're emulating a language feature, a simple method call, it's logical to only throw custom exceptions in .. well .. exceptional cases, like ambiguity/no matching method. If client code throws Errors/RuntimeExceptions, they should be transparently rethrown. was (Author: earwin): The static cache is now not threadsafe. And original had nice diagnostics for ambigous dispatches. Why not just take it and cut over to JDK reflection and CHM? Support Query Visting / Walking --- Key: LUCENE-3041 URL: https://issues.apache.org/jira/browse/LUCENE-3041 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 4.0 Reporter: Chris Male Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch Out of the discussion in LUCENE-2868, it could be useful to add a generic Query Visitor / Walker that could be used for more advanced rewriting, optimizations or anything that requires state to be stored as each Query is visited. We could keep the interface very simple: {code} public interface QueryVisitor { Query visit(Query query); } {code} and then use a reflection based visitor like Earwin suggested, which would allow implementators to provide visit methods for just Querys that they are interested in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation
[ https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027626#comment-13027626 ] Earwin Burrfoot commented on LUCENE-3061: - Mark these as @experimental? Open IndexWriter API to allow custom MergeScheduler implementation -- Key: LUCENE-3061 URL: https://issues.apache.org/jira/browse/LUCENE-3061 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3061.patch, LUCENE-3061.patch IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which makes it impossible for someone to implement his own MergeScheduler. We should open up these API, as well as any other that can be useful for custom MS implementations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Dunno, I'm quite happy with numLargeSegments (you critically misspelled it). It neatly avoids uber-merges, keeps the number of segments at bay, and does not require to recalculate thresholds when my expected index size changes. The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ 2011/5/2 Shai Erera ser...@gmail.com: I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ I agree. I wonder tough if the knobs we give on LogMP are intuitive enough. It neatly avoids uber-merges I didn't see that I can define what uber-merge is, right? Can I tell it to stop merging segments of some size? E.g., if my index grew to 100 segments, 40GB each, I don't think that merging 10 40GB segments (to create 400GB segment) is going to speed up my search, for instance. A 40GB segment (probably much less) is already big enough to not be touched anymore. No, you can't. But you can tell it to have exactly (not 'at most') N top-tier segments and try to keep their sizes close with merges. Whatever that size may be. And this is exactly what I want. And defining max cap on segment size is not what I want. So the same set of knobs can be intuitive and meaningful for one person, and useless for another. And you can't pick the best one. Will BalancedMP stop merging such segments (if all segments are of that order of magnitude)? Shai On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote: Dunno, I'm quite happy with numLargeSegments (you critically misspelled it). It neatly avoids uber-merges, keeps the number of segments at bay, and does not require to recalculate thresholds when my expected index size changes. The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ 2011/5/2 Shai Erera ser...@gmail.com: I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail
Re: Setting the max number of merge threads across IndexWriters
Almost any design that keeps circular references between components is broken. Inability to share MergeSchedulers is just another testimonial to that. 2011/4/16 Shai Erera ser...@gmail.com: Hi This was raised in LUCENE-2755 (along with other useful refactoring to MS-IW-MP interaction). Here is the relevant comment which addresses Jason's particular issue: https://issues.apache.org/jira/browse/LUCENE-2755?focusedCommentId=12966029page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12966029 In short, we can refactor CMS to not hold to an IndexWriter member if we change a lot of the API. But IMO, an ExecutorServiceMS is the right way to go, if you don't mind giving up some CMS features, like controlling thread priority and stalling running threads. In fact, even w/ ExecutorServiceMS you can still achieve some (e.g., stalling), but some juggling will be required. Then, instead of trying to factor out IW members from this MS, you could share the same ES with all MS instances, each will keep a reference to a different IW member. This is just a thought though, I haven't tried it. Shai On Thu, Apr 14, 2011 at 8:23 PM, Earwin Burrfoot ear...@gmail.com wrote: Can't remember. Probably no. I started an experimental MS api rewrite (incorporating ability to share MSs between IWs) some time ago, but never had the time to finish it. On Thu, Apr 14, 2011 at 19:56, Simon Willnauer simon.willna...@googlemail.com wrote: On Thu, Apr 14, 2011 at 5:52 PM, Earwin Burrfoot ear...@gmail.com wrote: I proposed to decouple MergeScheduler from IW (stop keeping a reference to it). Then you can create a single CMS and pass it to all your IWs. Yep that was it... is there an issue for this? simon On Thu, Apr 14, 2011 at 19:40, Jason Rutherglen jason.rutherg...@gmail.com wrote: I think the proposal involved using a ThreadPoolExecutor, which seemed to not quite work as well as what we have. I think it'll be easier to simply pass a global context that keeps a counter of the actively running threads, and pass that into each IW's CMS? On Thu, Apr 14, 2011 at 8:25 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Thu, Apr 14, 2011 at 5:20 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Today the ConcurrentMergeScheduler allows setting the max thread count and is bound to a single IndexWriter. However in the [common] case of multiple IndexWriters running in the same process, this disallows one from managing the aggregate number of merge threads executing at any given time. I think this can be fixed, shall I open an issue? go ahead! I think I have seen this suggestion somewhere maybe you need to see if there is one already simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Setting the max number of merge threads across IndexWriters
You don't mean 'static var' under 'global'? I hope, very much. 2011/4/16 Jason Rutherglen jason.rutherg...@gmail.com: I'd rather not lose [important] functionality. I think a global max thread count is the least intrusive way to go, however I also need to see if that's possible. If so I'll open an issue and post a patch. 2011/4/15 Shai Erera ser...@gmail.com: Hi This was raised in LUCENE-2755 (along with other useful refactoring to MS-IW-MP interaction). Here is the relevant comment which addresses Jason's particular issue: https://issues.apache.org/jira/browse/LUCENE-2755?focusedCommentId=12966029page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12966029 In short, we can refactor CMS to not hold to an IndexWriter member if we change a lot of the API. But IMO, an ExecutorServiceMS is the right way to go, if you don't mind giving up some CMS features, like controlling thread priority and stalling running threads. In fact, even w/ ExecutorServiceMS you can still achieve some (e.g., stalling), but some juggling will be required. Then, instead of trying to factor out IW members from this MS, you could share the same ES with all MS instances, each will keep a reference to a different IW member. This is just a thought though, I haven't tried it. Shai On Thu, Apr 14, 2011 at 8:23 PM, Earwin Burrfoot ear...@gmail.com wrote: Can't remember. Probably no. I started an experimental MS api rewrite (incorporating ability to share MSs between IWs) some time ago, but never had the time to finish it. On Thu, Apr 14, 2011 at 19:56, Simon Willnauer simon.willna...@googlemail.com wrote: On Thu, Apr 14, 2011 at 5:52 PM, Earwin Burrfoot ear...@gmail.com wrote: I proposed to decouple MergeScheduler from IW (stop keeping a reference to it). Then you can create a single CMS and pass it to all your IWs. Yep that was it... is there an issue for this? simon On Thu, Apr 14, 2011 at 19:40, Jason Rutherglen jason.rutherg...@gmail.com wrote: I think the proposal involved using a ThreadPoolExecutor, which seemed to not quite work as well as what we have. I think it'll be easier to simply pass a global context that keeps a counter of the actively running threads, and pass that into each IW's CMS? On Thu, Apr 14, 2011 at 8:25 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Thu, Apr 14, 2011 at 5:20 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Today the ConcurrentMergeScheduler allows setting the max thread count and is bound to a single IndexWriter. However in the [common] case of multiple IndexWriters running in the same process, this disallows one from managing the aggregate number of merge threads executing at any given time. I think this can be fixed, shall I open an issue? go ahead! I think I have seen this suggestion somewhere maybe you need to see if there is one already simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027361#comment-13027361 ] Earwin Burrfoot commented on LUCENE-3055: - Could anyone remind me, why the hell do we still have Analyzer.tokenStream AND reusableTokenStream rampaging around and confusing minds? We always recommend to use the latter, Robert just fixed some of the core classes to use the latter. Also, if reusableTokenStream is the only method left standing, isn't it wise to hide actual reuse somewhere in Lucene internals and turn Analyzer into plain and dumb factory interface? LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers -- Key: LUCENE-3055 URL: https://issues.apache.org/jira/browse/LUCENE-3055 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 3.1 Reporter: Ian Soboroff LUCENE-2372 and LUCENE-2389 marked all analyzers as final. This makes ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. StandardAnalyzer to make a small modification e.g. to tokenStream(). These issues don't indicate a new method of doing this. The issues don't give a reason except for design considerations, which seems a poor reason to make a backward-incompatible change -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2571) Indexing performance tests with realtime branch
[ https://issues.apache.org/jira/browse/LUCENE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13020217#comment-13020217 ] Earwin Burrfoot commented on LUCENE-2571: - bq. Merges are NOT blocking indexing on trunk no matter which MP you use. Well.. merges tie up IO (especially if not on fancy SSDs/RAIDs), which in turn lags flushes - bigger delays for stop the world flushes / lower bandwith cap (after which they are forced to stop the world) for parallel flushes. So Lance's point is partially valid. Indexing performance tests with realtime branch --- Key: LUCENE-2571 URL: https://issues.apache.org/jira/browse/LUCENE-2571 Project: Lucene - Java Issue Type: Task Components: Index Reporter: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: wikimedium.realtime.Standard.nd10M_dps.png, wikimedium.realtime.Standard.nd10M_dps_addDocuments.png, wikimedium.realtime.Standard.nd10M_dps_addDocuments_flush.png, wikimedium.trunk.Standard.nd10M_dps.png, wikimedium.trunk.Standard.nd10M_dps_addDocuments.png We should run indexing performance tests with the DWPT changes and compare to trunk. We need to test both single-threaded and multi-threaded performance. NOTE: flush by RAM isn't implemented just yet, so either we wait with the tests or flush by doc count. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Setting the max number of merge threads across IndexWriters
I proposed to decouple MergeScheduler from IW (stop keeping a reference to it). Then you can create a single CMS and pass it to all your IWs. On Thu, Apr 14, 2011 at 19:40, Jason Rutherglen jason.rutherg...@gmail.com wrote: I think the proposal involved using a ThreadPoolExecutor, which seemed to not quite work as well as what we have. I think it'll be easier to simply pass a global context that keeps a counter of the actively running threads, and pass that into each IW's CMS? On Thu, Apr 14, 2011 at 8:25 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Thu, Apr 14, 2011 at 5:20 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Today the ConcurrentMergeScheduler allows setting the max thread count and is bound to a single IndexWriter. However in the [common] case of multiple IndexWriters running in the same process, this disallows one from managing the aggregate number of merge threads executing at any given time. I think this can be fixed, shall I open an issue? go ahead! I think I have seen this suggestion somewhere maybe you need to see if there is one already simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Setting the max number of merge threads across IndexWriters
Can't remember. Probably no. I started an experimental MS api rewrite (incorporating ability to share MSs between IWs) some time ago, but never had the time to finish it. On Thu, Apr 14, 2011 at 19:56, Simon Willnauer simon.willna...@googlemail.com wrote: On Thu, Apr 14, 2011 at 5:52 PM, Earwin Burrfoot ear...@gmail.com wrote: I proposed to decouple MergeScheduler from IW (stop keeping a reference to it). Then you can create a single CMS and pass it to all your IWs. Yep that was it... is there an issue for this? simon On Thu, Apr 14, 2011 at 19:40, Jason Rutherglen jason.rutherg...@gmail.com wrote: I think the proposal involved using a ThreadPoolExecutor, which seemed to not quite work as well as what we have. I think it'll be easier to simply pass a global context that keeps a counter of the actively running threads, and pass that into each IW's CMS? On Thu, Apr 14, 2011 at 8:25 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Thu, Apr 14, 2011 at 5:20 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Today the ConcurrentMergeScheduler allows setting the max thread count and is bound to a single IndexWriter. However in the [common] case of multiple IndexWriters running in the same process, this disallows one from managing the aggregate number of merge threads executing at any given time. I think this can be fixed, shall I open an issue? go ahead! I think I have seen this suggestion somewhere maybe you need to see if there is one already simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Numerical ids for terms?
On Tue, Apr 12, 2011 at 13:41, Gregor Heinrich gre...@arbylon.net wrote: Hi -- has there been any effort to create a numerical representation of Lucene indices. That is, to use the Lucene Directory backend as a large term-document matrix at index level. As this would require bijective mapping between terms (per-field, as customary in Lucene) and a numerical index (integer, monotonous from 0 to numTerms()-1), I guess this requires some some special modifications to the Lucene core. Lucene index already provides term - id mapping in some form. Another interesting feature would be to use Lucene's Directory backend for storage of large dense matrices, for instance to data-mining tasks from within Lucene. Lucene's Directory is a dumb abstraction for random-access named write-once byte streams. It doesn't add /any/ value over mmap. Any suggestions? *troll mode on* Use numpy/scipy? :) -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
An IDF variation with penalty for very rare terms
Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- ? Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png Traditional log(N/x) tail, but when nearing zero freq, instead of going to +inf you do a nice round bump (with controlled height/location/sharpness) and drop down to -inf (or zero). Should be cool when doing cosine-measure(or something comparable)-based document comparisons (eg. in a more like this query, to mention Lucene at least once :) ), over dirty data. Rationale is that - most good, discriminating terms are found in at least a certain percentage of your documents, but there are lots of mostly unique crapterms, which at some collection sizes stop being strictly unique and with IDF's help explode your scores. -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant
On Fri, Apr 8, 2011 at 03:01, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : -1. These files should be readable, for maintaining, debugging and : knowing whats going on. Readability is my main concern ... i don't know (and frequently can't tell) the differnece between a lot of non ascii characters -- and i'm guessing i'm not alone. when it's spelled out explicitly using the character name or escape code, there is no ambiquity about what character was intended, or wether it got screwed up by some tool along the way (ie: the svn server, an svn client, the patch command, a text editor, an IDE, ant's fixcrlf task, etc...) Please take the time, just 5 or 10 minutes, to look thru some of this source code and tests. Imagine if you couldn't just look at the code to see what it does, but had to decode from some crazy numeric encoding scheme. Imagine if it were this way for things like stopword lists too. It would be basically impossible for you to look at the code and figure out what it does! For example, try looking at thai analyzer tests, if these were all numbers, how would you know wtf is going on? Although this comes up from time to time, I stand firm on my -1 because its important to me for the source code to be readable. I'm not willing to give this up just because some people cannot read writing system XYZ. I have said before, i'm willing to change my -1 vote on this, if *ALL* string constants (including english ones) are changed to be character escapes. If you imagine what the code would look like if english string constants were instead codes, then I think you will understand my point of view! Its really really important to source code readability to be able to open a file and understand what it does, not to have to use some decoder because it uses characters other people dont understand. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org I think having both raw characters /and/ encoded representation is the best? (one of them in comments) I'm all for unicode sources, but at least two things hit me repeatedly: 1. Tools do screw up, and you have to recover somehow. eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman in my case) and I've lost some text on things I shelved but never committed anywhere. 2. There are characters that look all the same. E.g. different whitespace/dashes. Or, (if you have cyrillic in your fonts) I dare you to discern between a/а, c/с, e/е, o/о. These are different characters from latin and cyrillic charsets (left latin/right cyrillic), but in 99% fonts they are visually identical. I had a filter that folded up similarily looking characters, and it was documented in exactly this way - raw char+code. -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [POLL] JTS compile/test dependency
On Wed, Apr 6, 2011 at 22:43, Robert Muir rcm...@gmail.com wrote: On Wed, Apr 6, 2011 at 2:12 PM, Ryan McKinley ryan...@gmail.com wrote: Some may be following the thread on spatial development... here is a quick summary, and a poll to help decide what may be the best next move. I'm hoping to introduce a high level spatial API that can be used for a variety of indexing strategies and computational needs. For simple point in BBox and point in WGS84 radius, this does not require any external libraries. To support more complex queries -- point in polygon, complex geometry intersections, etc -- we need an LGPL library JTS. The LGPL dependency is only needed to compile/test, there is no runtime requirement for JTS. To enable the more complicated options you would need to add JTS to the classpath and perhaps set a environment variable. This is essentially what we are now doing with the (soon to be removed) bdb contrib. I am trying to figure out the best home for this code and development to live. I think it is essential for the JTS support to be part of the core build/test -- splitting it into a separate module that is tested elsewhere is not an option. This raises the basic question of if people are willing to have the LGPL build dependency as part of the main lucene build. I think it is, but am sympathetic to the idea that it might not be. I'm sorta confused about this (i'll probably offend someone here, but so be it) We have a contrib module for spatial that is experimental, people want to deprecate, and say has problems. Why must the super-expert-polygon stuff sit with the basic capability that probably most users want: the ability to do basic searches (probably in combination with text too) in their app? Its hard for me to tell, i hope the reason isn't elegance, but why aren't we working on making a simple,supported,80-20 case in lucene that non-spatial-gurus (and users) understand and can maintain... then it would seem ideal for the complex stuff to be outside of this project with any dependencies it wants? Users are probably really confused about the spatial situation: is it because we are floundering around this expert stuff Handling Unicode code points outside of BMP is highly expert stuff as well. And is totally unneeded by 80% of the users for any other reason except elegance. I think you two guys can really understand each other here : ) -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [POLL] JTS compile/test dependency
On Thu, Apr 7, 2011 at 01:11, Robert Muir rcm...@gmail.com wrote: On Wed, Apr 6, 2011 at 5:07 PM, Earwin Burrfoot ear...@gmail.com wrote: Handling Unicode code points outside of BMP is highly expert stuff as well. And is totally unneeded by 80% of the users for any other reason except elegance. I think you two guys can really understand each other here : ) you are wrong: you either support unicode, or your application is buggy. Its not an optional feature, its the text standard used by the java programming language. You either handle the the Earth as a proper somewhat-ellipsoid, or your application is buggy. It's not an optional feature, it's even stronger than a standard - it is a physical fact experienced by all of us, earthlings. Though 80% of the users can throw geoids and unicode planes out of the window and live happily with some stupid local coordinate system and two-byte characters (some even manage with one-byte!). Yeah, they don't really care about being buggy in any geo/unicode-zealot's eyes. Having said that, it's cool that people like you two exist :) Because earth is round, maps are ugly, there are lots of different writing systems and someone has to deal with that. -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2981) Review and potentially remove unused/unsupported Contribs
[ https://issues.apache.org/jira/browse/LUCENE-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014108#comment-13014108 ] Earwin Burrfoot commented on LUCENE-2981: - Bye-bye, DB. Few things can compete with it in pointlessness. Review and potentially remove unused/unsupported Contribs - Key: LUCENE-2981 URL: https://issues.apache.org/jira/browse/LUCENE-2981 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Fix For: 3.2, 4.0 Attachments: LUCENE-2981.patch Some of our contribs appear to be lacking for development/support or are missing tests. We should review whether they are even pertinent these days and potentially deprecate and remove them. One of the things we did in Mahout when bringing in Colt code was to mark all code that didn't have tests as @deprecated and then we removed the deprecation once tests were added. Those that didn't get tests added over about a 6 mos. period of time were removed. I would suggest taking a hard look at: ant db lucli swing (spatial should be gutted to some extent and moved to modules) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Urgent! Forgot to close IndexWriter after adding Documents to the index.
On Tue, Mar 22, 2011 at 06:21, Chris Hostetter hossman_luc...@fucit.org wrote: (replying to the dev list, see context below) : Unfortunately, you can't easily recover from this (except by : reindexing your docs again). : : Failing to call IW.commit() or IW.close() means no segments file was written... I know there were good reasons for eliminating the autoCommit functionality from IndexWriter, but threads like tis make me thing thta even though autoCommit on flush/merge/whatever was bad, having an option for some sort of autoClose using a finalizer might by a good idea to give new/novice users a safety net. In the case of totally successful normal operation, this would result in one commit at GC (assuming the JVM calls the finalizer) and if there were any errors it should (if i understnad correclty) do an implicit rollback. Anyone see a downside? Yes. Totally unexpected magical behaviour. What if I didn't commit something on purporse? ... : I had a program running for 2 days to build an index for around 160 million : text files, and after program ended, I tried searching the index and found : the index was not correctly built, *indexReader.numDocs()* returns 0. I : checked the index directory, it looked good, all the index data seemed to be : there, the directory is 1.5 Gigabytes in size. : : I checked my code and found that I forgot to call *indexWriter.optimize()*and : *indexWriter.close()*, I want to know if it is possible to : *re-optimize()*the index so I don't need to rebuild the whole index : from scratch? I don't : really want the program to take another 2 days. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: IndexReader.indexExists declares throwing IOE, but never does
Technically, there's a big difference between I checked, and there was no index, and I was unable to check the disk because file system went BANG!. So the proper behaviour is to return false IOE (on proper occasion)? On Mon, Mar 21, 2011 at 13:53, Michael McCandless luc...@mikemccandless.com wrote: On Mon, Mar 21, 2011 at 12:52 AM, Shai Erera ser...@gmail.com wrote: Can we remove the declaration? The method never throws IOE, but instead catches it and returns false. I think it's reasonable that such a method will not throw exceptions. +1 -- Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: IndexReader.indexExists declares throwing IOE, but never does
2011/3/21 Shai Erera ser...@gmail.com: So the proper behaviour is to return false IOE (on proper occasion)? I don't object to it, as I think it's reasonable (as today we may be hiding some info from the app). However, given that today we never throw IOE, and that if we start doing so, we'll change runtime behavior, I lean towards keeping the method simple and remove the throws declaration. Well, it's either we change the impl to throw IOE, or remove the declaration altogether. Changing the impl to throw IOE on proper occasion might be problematic -- IndexNotFoundException is thrown when an empty index directory was given, however by its Javadocs, it can also indicate the index is corrupted. Perhaps the jdocs are wrong and it's thrown only if the index directory is empty, or no segments files are found. If that's the case, then we should change its javadocs. Otherwise, it will be difficult to know whether the INFE indicates an empty directory, for which you'll want to return false, or a corrupt index, for which you'll want to throw the exception. Besides, I consider this method almost like File.exists() which doesn't throw an exception. If indexExists() returns false, the app can decide to investigate further by trying to open IndexReader or read the SegmentInfos. But the API as-is needs to be simple IMO. File.exists() parallel is a good one. So, maybe, it's ok ) Otherwise please keep the throws declaration so that you won't break public APIs if this changes implementation. Removing the throws declaration doesn't break apps. In the worse case, they'll have a catch block which is redundant? Shai On Mon, Mar 21, 2011 at 4:12 PM, Sanne Grinovero sanne.grinov...@gmail.com wrote: 2011/3/21 Earwin Burrfoot ear...@gmail.com: Technically, there's a big difference between I checked, and there was no index, and I was unable to check the disk because file system went BANG!. So the proper behaviour is to return false IOE (on proper occasion)? +1 to throw the exception when proper to do so Otherwise please keep the throws declaration so that you won't break public APIs if this changes implementation. On Mon, Mar 21, 2011 at 13:53, Michael McCandless luc...@mikemccandless.com wrote: On Mon, Mar 21, 2011 at 12:52 AM, Shai Erera ser...@gmail.com wrote: Can we remove the declaration? The method never throws IOE, but instead catches it and returns false. I think it's reasonable that such a method will not throw exceptions. +1 -- Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007048#comment-13007048 ] Earwin Burrfoot commented on LUCENE-2960: - bq. Oh yeah. But then we'd clone the full IWC on every set... this seems like overkill in the name of purity. So what? What exactly is overkill? Few wasted bytes and CPU ns for an object that's created a couple of times during application lifetime? There are also builders, which are very similar to what Steven is proposing. bq. Another thought is to offer all settings on the IWC for init convenience and exposure and then add javadoc about updaters on IW for those settings that can be changed on the fly That's exactly how I'd like to see it. Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter -- Key: LUCENE-2960 URL: https://issues.apache.org/jira/browse/LUCENE-2960 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shay Banon Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2960.patch In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. It would be great to be able to control that on a live IndexWriter. Other possible two methods that would be great to bring back are setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other setters can actually be set on the MergePolicy itself, so no need for setters for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007136#comment-13007136 ] Earwin Burrfoot commented on LUCENE-2960: - You avoid deprecation/undeprecation and binary incompatibility, while incompatibly changing semantics. What do you win? Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter -- Key: LUCENE-2960 URL: https://issues.apache.org/jira/browse/LUCENE-2960 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shay Banon Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2960.patch In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. It would be great to be able to control that on a live IndexWriter. Other possible two methods that would be great to bring back are setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other setters can actually be set on the MergePolicy itself, so no need for setters for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006759#comment-13006759 ] Earwin Burrfoot commented on LUCENE-2960: - bq. infoStream is a PrintStream, which synchronizes anyway, so it should be safe to omit the volatile You're absolutely right here. bq. Yet, no real Java impl out there will ever do this since doing so will simply make that Java impl appear buggy. Sorry, but real Java impls do this. The case with endless get() happened on a map that was never modified after being created and set. Just one of the many JVM instances on many machines got unlucky after restart. bq. Well, and, it'd be bad for perf. – obviously the Java impl, CPU cache levels, should cache only frequently used things Java impls don't cache things. They do reorderings, they also keep final fields on registers, omitting reloads that happen for non-final ones, but no caching in JMM-related cases. Caching here is done by CPU, and it caches all data read from memory. bq. IWC cannot be made immutable – you build it up incrementally (new IWC(...).setThis(...).setThat(...)). Its fields cannot be final. Setters can return modified immutable copy of 'this'. So you get both incremental building and immutability. bq. How about this as a compromise: IW continues cloning the incoming IWC on init, as it does today. This means any changes to the IWC instance you passed to IW will have no effect on IW. What about earlier compromise mentioned by Shay, Mark, me? Keep setters for 'live' properties on IW. This clearly draws the line, and you don't have to consult Javadocs for each and every setting to know if you can change it live or not. Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter -- Key: LUCENE-2960 URL: https://issues.apache.org/jira/browse/LUCENE-2960 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shay Banon Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2960.patch In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. It would be great to be able to control that on a live IndexWriter. Other possible two methods that would be great to bring back are setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other setters can actually be set on the MergePolicy itself, so no need for setters for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GPU acceleration
On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote: To clarify, I've not yet written any code. I aim to bring a large speedup to any functionality that is computationally expensive. I'm wondering which components are candidates for this. I'll be looking through the code but if anyone is aware of parallelizable code, I'll start with that. More like 'vectorizable' code, huh? Guys from Yandex use modified group varint encoding plus handcrafted SSE magic to decode/intersect posting lists and claim tremendous speedups over original group varint. They also use SSE to run the decision trees used in ranking. There were experiments with moving both pieces of code to the GPU, and GPU did well in terms of speed, but they say getting data in and out of GPU made the approach unfeasible. I'll basically replicate existing functionality to run on the gpu. On 12/03/11 21:08, Simon Willnauer wrote: On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org wrote: Hi, Is anyone looking at GPU acceleration for Solr? If not, I'd like to contribute code which adds this functionality. As I'm not familiar with the codebase, does anyone know which areas of functionality could benefit from high degrees of parallelism. Very interesting can you elaborate a little more what kind of functionality you exposed / try to expose to the GPU? simon Regards, Ken - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006227#comment-13006227 ] Earwin Burrfoot commented on LUCENE-2960: - {quote} Why such purity? What do we gain? I'm all for purity, but only if it doesn't interfere w/ functionality. Here, it's taking away freedom... {quote} We gain consistency and predictability. And there are a lot of freedoms dangerous for developers. {quote} In fact it should be fine to share an IWC across multiple writers; you can change the RAM buffer for all of them at once. {quote} You've brought up a purrfect example of how NOT to do things. This is called 'action at a distance' and is a damn bug. Very annoying one. I've thoroughly experienced it with previous major version of Apache HTTPClient - they had an API that suggested you can set per-request timeouts, while these were actually global for a single Client instance. I fried my brain trying to understand why the hell random user requests timeout at hundred times their intended duration. Oh! It was an occasional admin request changing the global. irony You know, you can actually instantiate some DateRangeFilter with a couple of Dates, and then change these Dates (they are writeable) before each request. Isn't it an exciting kind of programming freedom? Or, back to our current discussion - we can pass RAMBufferSizeMB as an AtomicDouble, instead of current double, then we can use .set() on an instance we passed, and have our live reconfigurability. What's more - AtomicDouble protects us from word tearing! /irony bq. I doubt there's any JVM out there where our lack-of-volatile infoStream causes any problems. Er.. While I have never personally witnessed unsynchronized long/double tearing, I've seen the consequence of unsafely publishing a HashMap - an endless loop on get(). It happened on your run off the mill Sun 1.6 JVM. So the bug is there, lying in wait. Maybe nobody ever actually used the freedom to change infoStream in-flight, or the guy was lucky, or in his particular situation the field was guarded by some unrelated sync. While I see banishing live reconfiguration from IW as a lost cause, I ask to make IWC immutable at the very least. As Shay said - this will provide a clear barrier between mutable and immutable properties. Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter -- Key: LUCENE-2960 URL: https://issues.apache.org/jira/browse/LUCENE-2960 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shay Banon Priority: Blocker Fix For: 3.1, 4.0 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. It would be great to be able to control that on a live IndexWriter. Other possible two methods that would be great to bring back are setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other setters can actually be set on the MergePolicy itself, so no need for setters for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: IndexWriter#setRAMBufferSizeMB removed in trunk
Is it really that hard to recreate IndexWriter if you have to change the settings?? Yeah, yeah, you lose all your precious reused buffers, and maybe there's a small indexing latency spike, when switching from old IW to new one, but people aren't changing their IW configs several times a second? I suggest banning as much runtime-mutable settings as humanely possible, and ask people to recreate objects for reconfiguration, be it IW, IR, Analyzers, whatnot. On Thu, Mar 10, 2011 at 23:07, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Mar 10, 2011 at 7:28 AM, Robert Muir rcm...@gmail.com wrote: This should block the release: if IndexWriterConfig is a broken design then we need to revert this now before its released, not make users switch over and then undeprecate/revert in a future release. +1 I think we have to sort this out, one way or another, before releasing 3.1. I really don't like splitting setters across IWC vs IW. That'll just cause confusion, and noise over time as we change our minds about where things belong. Looking through IWC, it seems that most setters can be done live. In fact, setRAMBufferSizeMB is *almost* live: all places in IW that use this pull it from the config, except for DocumentsWriter. We could just push the config down to DW and have it pull live too... Other settings are not pulled live but for no good reason, eg termsIndexInterval is copied to a private field in IW but could just as easily be pulled when it's time to write a new segment... Maybe we should simply document which settings are live vs only take effect at init time? Mike -- Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005617#comment-13005617 ] Earwin Burrfoot commented on LUCENE-2960: - As I said on the list - if one needs to change IW config, he can always recreate IW with new settings. Such changes cannot happen often enough for recreation to affect indexing performance. The fact that you can change IW's behaviour post-construction by modifying unrelated IWC instance is frightening. IW should either make a private copy of IWC when constructing, or IWC should be made immutable. Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter -- Key: LUCENE-2960 URL: https://issues.apache.org/jira/browse/LUCENE-2960 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shay Banon Priority: Blocker Fix For: 3.1, 4.0 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. It would be great to be able to control that on a live IndexWriter. Other possible two methods that would be great to bring back are setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other setters can actually be set on the MergePolicy itself, so no need for setters for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: IndexWriter#setRAMBufferSizeMB removed in trunk
Thanks for your support, but I don't think setInfoStream makes any sense either : ) Do we /change/ infoStreams for IW @runtime? Why can't we pass it as constructor argument/IWC field? Ok, just maybe, I can imagine a case, where a certain app runs happily, then misbehaves, and then you, with some clever trickery supply it a fresh infoStream, to capture the problem live, without restarting. So, just maybe, we should leave setInfoStream asis. 2011/3/11 Shai Erera ser...@gmail.com: I agree. After IWC, the only setter left in IW is setInfoStream which makes sense. But the rest ... assuming these config change don't happen very often, recreating IW doesn't sound like a big thing to me. The alternative of complicating IWC to support runtime changes -- we need to be absolutely sure it's worth it. Also, if the solution is to allow changing IWC (runtime) settings, then I don't think this issue should block 3.1? We can anyway add other runtime settings following 3.1, and we won't undeprecate anything. So maybe mark that issue as a non-blocker? Shai On Fri, Mar 11, 2011 at 2:20 PM, Earwin Burrfoot ear...@gmail.com wrote: Is it really that hard to recreate IndexWriter if you have to change the settings?? Yeah, yeah, you lose all your precious reused buffers, and maybe there's a small indexing latency spike, when switching from old IW to new one, but people aren't changing their IW configs several times a second? I suggest banning as much runtime-mutable settings as humanely possible, and ask people to recreate objects for reconfiguration, be it IW, IR, Analyzers, whatnot. On Thu, Mar 10, 2011 at 23:07, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Mar 10, 2011 at 7:28 AM, Robert Muir rcm...@gmail.com wrote: This should block the release: if IndexWriterConfig is a broken design then we need to revert this now before its released, not make users switch over and then undeprecate/revert in a future release. +1 I think we have to sort this out, one way or another, before releasing 3.1. I really don't like splitting setters across IWC vs IW. That'll just cause confusion, and noise over time as we change our minds about where things belong. Looking through IWC, it seems that most setters can be done live. In fact, setRAMBufferSizeMB is *almost* live: all places in IW that use this pull it from the config, except for DocumentsWriter. We could just push the config down to DW and have it pull live too... Other settings are not pulled live but for no good reason, eg termsIndexInterval is copied to a private field in IW but could just as easily be pulled when it's time to write a new segment... Maybe we should simply document which settings are live vs only take effect at init time? Mike -- Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005891#comment-13005891 ] Earwin Burrfoot commented on LUCENE-2960: - bq. Furthermore, closing the IW also forces you to commit, and I don't like tying changing of configuration to forcing a commit. Like I said, one isn't going to change his configuration five times a second. It's ok to commit from time to time? bq. So why should we force it to be unchangeable? That can only remove freedom, freedom that is perhaps valuable to an app somewhere. Each and every live reconfigurable setting adds to complexity. At the very least it requires proper synchronization. Take your SegmentWarmer example - you should make the field volatile. While it's possible to chicken out on primitive fields ([except long/double|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7]), as Yonik mentioned earlier, making nonvolatile mutable references introduces you to a world of hard-to-catch unsafe publication bugs (yes, infoStream is currently broken!). For more complex cases, certain on-change logic is required. And then you have to support this logic across all possible code rewrites and refactorings. Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter -- Key: LUCENE-2960 URL: https://issues.apache.org/jira/browse/LUCENE-2960 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shay Banon Priority: Blocker Fix For: 3.1, 4.0 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. It would be great to be able to control that on a live IndexWriter. Other possible two methods that would be great to bring back are setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other setters can actually be set on the MergePolicy itself, so no need for setters for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2908) clean up serialization in the codebase
[ https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994769#comment-12994769 ] Earwin Burrfoot commented on LUCENE-2908: - Oh, damn :) On my project, we specifically use java-serialization to pass configured Queries/Filters between cluster nodes, as it saves us HEAPS of wrapping/unwrapping them into some parallel serializable classes. clean up serialization in the codebase -- Key: LUCENE-2908 URL: https://issues.apache.org/jira/browse/LUCENE-2908 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-2908.patch We removed contrib/remote, but forgot to cleanup serialization hell everywhere. this is no longer needed, never really worked (e.g. across versions), and slows development (e.g. i wasted a long time debugging stupid serialization of Similarity.idfExplain when trying to make a patch for the scoring system). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [REINDEX] Note: re-indexing required !
Lucene maintains compatibility with earlier stable release index versions, and to some extent transparently upgrades them. But there is no guaranteed compatibility between different in-development indexes. E.g. 3.2 reads 3.1 indexes and upgrades them, but 3.2-dev-snapshot-10 (while happily handling 3.1) may fail reading 3.2-dev-snapshot-8 index, as they have the same version tag, yet different formats. On Sun, Jan 23, 2011 at 19:18, Earl Hood e...@earlhood.com wrote: On Sat, Jan 22, 2011 at 11:14 PM, Shai Erera ser...@gmail.com wrote: Under LUCENE-2720 the index format of both trunk and 3x has changed. You should re-index any indexes created with either of these code streams. Does the 3x refer to the 3.x development branch? I.e. For those of using the stable 3.x release of Lucene, will a future 3.x release require rebuilding indexes? --ewh - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2871) Use FileChannel in FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12984222#action_12984222 ] Earwin Burrfoot commented on LUCENE-2871: - Before arguing where to put this new IndexOutput, I think it's wise to have a benchmark proving we need it at all. I have serious doubts FileChannel's going to outperform RAF.write(). Why should it? And for the purporses of benchmark it can be anywhere. Use FileChannel in FSDirectory -- Key: LUCENE-2871 URL: https://issues.apache.org/jira/browse/LUCENE-2871 Project: Lucene - Java Issue Type: New Feature Components: Store Reporter: Shay Banon Attachments: LUCENE-2871.patch, LUCENE-2871.patch Explore using FileChannel in FSDirectory to see if it improves write operations performance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Let's drop Maven Artifacts !
Somehow, they were made available since 2.0 - http://repo2.maven.org/maven2/org/apache/lucene/lucene-core/ The pom's are minimal, sans dependencies, so eg if your project depends on lucene-spellchecker, lucene-core won't be transitively included and your build is gonna fail (you therefore had to add dependency on the core to your project yourself). But they were enough to download and link jars/sources/javadocs. On Tue, Jan 18, 2011 at 12:40, Shai Erera ser...@gmail.com wrote: Out of curiosity, how did the Maven people integrate Lucene before we had Maven artifacts. To the best of my understanding, we never had proper Maven artifacts (Steve is working on that in LUCENE-2657). Shai On Tue, Jan 18, 2011 at 11:03 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Tue, Jan 18, 2011 at 9:33 AM, Thomas Koch tho...@koch.ro wrote: Hi, the developers list may not be the right place to find strong maven supporters. All developers know lucene from inside out and are perfectly fine to install lucene from whatever artifact. Those people using maven are your end users, that propably don't even subscribe to users@. big +1 for this comment! I have to admit that I am not a big maven fan and each time I have to use it its a pain in the ass but it is the de-facto standard for the majority of java projects on this planet so really there is not much of an option in my opinion. A project like lucene has to release maven artifacts even if its a pain. Simon Thomas Koch, http://www.koch.ro - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly
[ https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983160#action_12983160 ] Earwin Burrfoot commented on LUCENE-2657: - bq. we need to be very clear and it has no effect on artifacts I feel something was missed in the heat of debate. Eg: bq. The latest patch on this release uses the Ant artifacts directly. bq. This patch uses the Ant-produced artifacts to prepare for Maven artifact publishing. bq. Maven itself is not invoked in the process. An Ant plugin handles the artifact deployment. I will now try to decipher these quotes. It seems the patch takes the artifacts produced by Ant, as a part of our usual (and only) build process, and shoves it down Maven repository's throat along with a bunch of pom-descriptors. Nothing else is happening. Also, after everything that has been said, I think nobody in his right mind will *force* anyone to actually use the Ant target in question as a part of release. But it's nice to have it around, in case some user-friendly commiter would like to push (I'd like to reiterate - ant generated) artifacts into Maven. Replace Maven POM templates with full POMs, and change documentation accordingly Key: LUCENE-2657 URL: https://issues.apache.org/jira/browse/LUCENE-2657 Project: Lucene - Java Issue Type: Improvement Components: Build Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Assignee: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch The current Maven POM templates only contain dependency information, the bare bones necessary for uploading artifacts to the Maven repository. The full Maven POMs in the attached patch include the information necessary to run a multi-module Maven build, in addition to serving the same purpose as the current POM templates. Several dependencies are not available through public maven repositories. A profile in the top-level POM can be activated to install these dependencies from the various {{lib/}} directories into your local repository. From the top-level directory: {code} mvn -N -Pbootstrap install {code} Once these non-Maven dependencies have been installed, to run all Lucene/Solr tests via Maven's surefire plugin, and populate your local repository with all artifacts, from the top level directory, run: {code} mvn install {code} When one Lucene/Solr module depends on another, the dependency is declared on the *artifact(s)* produced by the other module and deposited in your local repository, rather than on the other module's un-jarred compiler output in the {{build/}} directory, so you must run {{mvn install}} on the other module before its changes are visible to the module that depends on it. To create all the artifacts without running tests: {code} mvn -DskipTests install {code} I almost always include the {{clean}} phase when I do a build, e.g.: {code} mvn -DskipTests clean install {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly
[ https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983162#action_12983162 ] Earwin Burrfoot commented on LUCENE-2657: - Thanks, but I'm not the one confused here. : ) Replace Maven POM templates with full POMs, and change documentation accordingly Key: LUCENE-2657 URL: https://issues.apache.org/jira/browse/LUCENE-2657 Project: Lucene - Java Issue Type: Improvement Components: Build Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Assignee: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch The current Maven POM templates only contain dependency information, the bare bones necessary for uploading artifacts to the Maven repository. The full Maven POMs in the attached patch include the information necessary to run a multi-module Maven build, in addition to serving the same purpose as the current POM templates. Several dependencies are not available through public maven repositories. A profile in the top-level POM can be activated to install these dependencies from the various {{lib/}} directories into your local repository. From the top-level directory: {code} mvn -N -Pbootstrap install {code} Once these non-Maven dependencies have been installed, to run all Lucene/Solr tests via Maven's surefire plugin, and populate your local repository with all artifacts, from the top level directory, run: {code} mvn install {code} When one Lucene/Solr module depends on another, the dependency is declared on the *artifact(s)* produced by the other module and deposited in your local repository, rather than on the other module's un-jarred compiler output in the {{build/}} directory, so you must run {{mvn install}} on the other module before its changes are visible to the module that depends on it. To create all the artifacts without running tests: {code} mvn -DskipTests install {code} I almost always include the {{clean}} phase when I do a build, e.g.: {code} mvn -DskipTests clean install {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Let's drop Maven Artifacts !
On Tue, Jan 18, 2011 at 17:00, Robert Muir rcm...@gmail.com wrote: On Tue, Jan 18, 2011 at 8:54 AM, Grant Ingersoll gsing...@apache.org wrote: It seems to me that if we have a fix for the things that ail our Maven support (Steve's work), that it isn't then the reason for holding up a release and we should just keep them as there are a significant number of users who consume Lucene that way (via the central repository). I agree that we should not switch our build system, but supporting the POMs is no different than supporting the IntelliJ/Eclipse generation tools (they are both problematic since they are not automated) its totally different in every way! we don't release the intellij/eclipse stuff, its for internal use only. additionally, there are no release artifacts generated by these Latest code from LUCENE-2657 does not generate any new artifacts. It uploads those you already have (built via ant) to the repo. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Let's drop Maven Artifacts !
On Tue, Jan 18, 2011 at 20:13, Robert Muir rcm...@gmail.com wrote: Unfortunately there is a very loud minority that care about maven I would wager that there is a sizable silent *majority* of users who literally depend on Lucene's Maven artifacts. I can't help but remind myself, this is the same argument Oracle offered up for the whole reason hudson debacle (http://hudson-labs.org/content/whos-driving-thing) Declaring that I have a secret pocket of users that want XYZ isn't open source consensus. There is proof of existance for some unknown part of this secret pool. http://www.google.com/search?q=%22artifactid+lucene-core%22 Please, don't look at About NNN results, these are known to be veeery approximate. Just page through. Some of the pages are Lucene poms themselves. Many of them are poms for the projects depending on lucene. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2755) Some improvements to CMS
[ https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982564#action_12982564 ] Earwin Burrfoot commented on LUCENE-2755: - bq. if you still want to work on it, the I can keep the issue open and mark it 3.2 (unless you want to give it a try in 3.1). I'll start another later, so please, go on. Some improvements to CMS Key: LUCENE-2755 URL: https://issues.apache.org/jira/browse/LUCENE-2755 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.1, 4.0 Attachments: LUCENE-2755.patch While running optimize on a large index, I've noticed several things that got me to read CMS code more carefully, and find these issues: * CMS may hold onto a merge if maxMergeCount is hit. That results in the MergeThreads taking merges from the IndexWriter until they are exhausted, and only then that blocked merge will run. I think it's unnecessary that that merge will be blocked. * CMS sorts merges by segments size, doc-based and not bytes-based. Since the default MP is LogByteSizeMP, and I hardly believe people care about doc-based size segments anymore, I think we should switch the default impl. There are two ways to make it extensible, if we want: ** Have an overridable member/method in CMS that you can extend and override - easy. ** Have OneMerge be comparable and let the MP determine the order (e.g. by bytes, docs, calibrate deletes etc.). Better, but will need to tap into several places in the code, so more risky and complicated. On the go, I'd like to add some documentation to CMS - it's not very easy to read and follow. I'll work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Let's drop Maven Artifacts !
You're not alone. :) But, I bet, much more people would like to skip that step and have their artifacts downloaded from central. On Mon, Jan 17, 2011 at 19:06, Steven A Rowe sar...@syr.edu wrote: On 1/17/2011 at 1:53 AM, Michael Busch wrote: I don't think any user needs the ability to run an ant target on Lucene's sources to produce maven artifacts I want to be able to make modifications to the Lucene source, install Maven snapshot artifacts in my local repository, then depend on those snapshots from other projects. I doubt I'm alone. Steve -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Let's drop Maven Artifacts !
Maven is a defacto package/dependency manager for Java. Like it or not. All better tools out there, like Ant+Ivy, or SBT - support Maven repositories. Lots of people rely on Maven or better tools for their builds and as soon as you're on declarative dependency management train, it's a bother to just take a bunch of jars and stuff 'em into your project. Development tools (Eclipse/IDEA) support auto-downloading and attaching sources/javadocs for declared dependencies, and people use this. Well ... you raise interesting points. So if a committer would be willing to support GIT, RTC, and whatever (just making up scenarios), would we allow all of those to exist within Lucene? So, while having a wild contributor supporting .. dunno .. MacPorts package for Lucene is a bit crazy, and in the end - nobody will ever notice, supporting Maven broadens your audience and makes it happy (even those guys, who are not into Maven itself). I think the reasonable solution is to have a modules/maven package, with build.xml that generates whatever needs to be generated. Whoever cares about maven should run the proper Ant targets, just like whoever cares about Eclipse/IDEA can now run ant eclipse/idea. We'd have an ant maven. If that's what you intend doing in 2657 then fine. That should be some person amongst the committers, be it a part of default release process or not. I believe publishing Maven artefact is somewhat nontrivial for a person not related to the project in question. The release manager need not be concerned w/ Maven (or whatever) artifacts, they are not officially published anywhere, and everyone's happy. As long as all tests pass, the release is good to go. Is that better? Shai On Sun, Jan 16, 2011 at 8:05 PM, Steven A Rowe sar...@syr.edu wrote: -1 from me on dropping Maven artifacts. I find it curious that on the verge of fixing the broken Maven artifacts situation (LUCENE-2657), there is a big push for a divorce. Robert, I agree we should have a way to test the magic artifacts. I'm working on it. Your other objection is the work involved - you don't want to do it. I will do the work. We should not drop Maven support when there are committers willing to support it. I obviously count myself in that camp. Steve Robert Muir rcm...@gmail.com wrote: On Sun, Jan 16, 2011 at 12:03 PM, Shai Erera ser...@gmail.com wrote: Hey Wearing on my rebel hat today, I'd like to propose we drop maven support from our release process / build system. I've always read about the maven artifacts never being produced right, and never working (or maybe never is a too harsh word). I personally don't understand why we struggle to support Maven. I'm perfectly fine if we say that Lucene/Solr uses SVN, Ant and release a bunch of .jar files you can embed in your project. Who says we need to support Maven? And if so, why only Maven (I'm kidding !)? Are you with me? :) I am, the last time i suggested releasing 3.1, a 99-email thread about maven ensued that basically left me frustrated and not wanting to work towards a release. We still don't have a test-maven target that does even trivial verification of these magical artifacts that most of us don't understand... like any other functionality we have, we should have tests so that the release manager can verify things are working before the release. If we have a contrib thats unmaintained with no tests, would we let it block a release? I don't think we should let the maven problems hold lucene releases hostage. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2374) Add introspection API to AttributeSource/AttributeImpl
[ https://issues.apache.org/jira/browse/LUCENE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982437#action_12982437 ] Earwin Burrfoot commented on LUCENE-2374: - Nice. Except maybe introduce a simple interface instead of the MapString, ? ? {code} interface AttributeReflector { // Name is crap, should be changed void reflect(String key, Object value); } void reflectWith(AttributeReflector reflector); {code} You have no need for fake maps then, both in toString(), and in user code. Add introspection API to AttributeSource/AttributeImpl -- Key: LUCENE-2374 URL: https://issues.apache.org/jira/browse/LUCENE-2374 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1, 4.0 AttributeSource/TokenStream inspection in Solr needs to have some insight into the contents of AttributeImpls. As LUCENE-2302 has some problems with toString() [which is not structured and conflicts with CharSequence's definition for CharTermAttribute], I propose an simple API that get a default implementation in AttributeImpl (just like toString() current): - IteratorMap.EntryString,? AttributeImpl.contentsIterator() returns an iterator (for most attributes its a singleton) of a key-value pair, e.g. term-foobar,startOffset-Integer.valueOf(0),... - AttributeSource gets the same method, it just concat the iterators of each getAttributeImplsIterator() AttributeImpl No backwards problems occur, as the default toString() method will work like before (it just gets iterator and lists), but we simply remove the documentation for the format. (Char)TermAttribute gets a special impl fo toString() according to CharSequence and a corresponding iterator. I also want to remove the abstract hashCode() and equals() methods from AttributeImpl, as they are not needed and just create work for the implementor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982126#action_12982126 ] Earwin Burrfoot commented on LUCENE-2858: - bq. Any comments about removing write access from IndexReaders? I think setNorms() will be removed soon, but how about the others like deleteDocument()? I would propose to also make all IndexReaders simply readers not writers? Voting with all my extremities - yes!! Separate SegmentReaders (and other atomic readers) from composite IndexReaders -- Key: LUCENE-2858 URL: https://issues.apache.org/jira/browse/LUCENE-2858 Project: Lucene - Java Issue Type: Task Reporter: Uwe Schindler Fix For: 4.0 With current trunk, whenever you open an IndexReader on a directory you get back a DirectoryReader which is a composite reader. The interface of IndexReader has now lots of methods that simply throw UOE (in fact more than 50% of all methods that are commonly used ones are unuseable now). This confuses users and makes the API hard to understand. This issue should split atomic readers from reader collections with a separate API. After that, you are no longer able, to get TermsEnum without wrapping from those composite readers. We currently have helper classes for wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or Multi*), those should be retrofitted to implement the correct classes (SlowMultiReaderWrapper would be an atomic reader but takes a composite reader as ctor param, maybe it could also simply take a ListAtomicReader). In my opinion, maybe composite readers could implement some collection APIs and also have the ReaderUtil method directly built in (possibly as a view in the util.Collection sense). In general composite readers do not really need to look like the previous IndexReaders, they could simply be a collection of SegmentReaders with some functionality like reopen. On the other side, atomic readers do not need reopen logic anymore? When a segment changes, you need a new atomic reader? - maybe because of deletions thats not the best idea, but we should investigate. Maybe make the whole reopen logic simplier to use (ast least on the collection reader level). We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982132#action_12982132 ] Earwin Burrfoot commented on LUCENE-2858: - bq. Still, i think we would need this method (somewhere) even with CSF, so that people can change the norms and they instantly take effect for searches. This still puzzles me. I can strain my imagination, and get people who just need to change norms without reindexing. But doing this and *requiring* instant turnaround? Kid me not :) Separate SegmentReaders (and other atomic readers) from composite IndexReaders -- Key: LUCENE-2858 URL: https://issues.apache.org/jira/browse/LUCENE-2858 Project: Lucene - Java Issue Type: Task Reporter: Uwe Schindler Fix For: 4.0 With current trunk, whenever you open an IndexReader on a directory you get back a DirectoryReader which is a composite reader. The interface of IndexReader has now lots of methods that simply throw UOE (in fact more than 50% of all methods that are commonly used ones are unuseable now). This confuses users and makes the API hard to understand. This issue should split atomic readers from reader collections with a separate API. After that, you are no longer able, to get TermsEnum without wrapping from those composite readers. We currently have helper classes for wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or Multi*), those should be retrofitted to implement the correct classes (SlowMultiReaderWrapper would be an atomic reader but takes a composite reader as ctor param, maybe it could also simply take a ListAtomicReader). In my opinion, maybe composite readers could implement some collection APIs and also have the ReaderUtil method directly built in (possibly as a view in the util.Collection sense). In general composite readers do not really need to look like the previous IndexReaders, they could simply be a collection of SegmentReaders with some functionality like reopen. On the other side, atomic readers do not need reopen logic anymore? When a segment changes, you need a new atomic reader? - maybe because of deletions thats not the best idea, but we should investigate. Maybe make the whole reopen logic simplier to use (ast least on the collection reader level). We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982166#action_12982166 ] Earwin Burrfoot commented on LUCENE-2858: - APIs have to be there still. All that commity, segment-deletery, mutabley stuff (that spans both atomic and composite readers). So, while your plan is viable, it won't remove that much cruft. Separate SegmentReaders (and other atomic readers) from composite IndexReaders -- Key: LUCENE-2858 URL: https://issues.apache.org/jira/browse/LUCENE-2858 Project: Lucene - Java Issue Type: Task Reporter: Uwe Schindler Fix For: 4.0 With current trunk, whenever you open an IndexReader on a directory you get back a DirectoryReader which is a composite reader. The interface of IndexReader has now lots of methods that simply throw UOE (in fact more than 50% of all methods that are commonly used ones are unuseable now). This confuses users and makes the API hard to understand. This issue should split atomic readers from reader collections with a separate API. After that, you are no longer able, to get TermsEnum without wrapping from those composite readers. We currently have helper classes for wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or Multi*), those should be retrofitted to implement the correct classes (SlowMultiReaderWrapper would be an atomic reader but takes a composite reader as ctor param, maybe it could also simply take a ListAtomicReader). In my opinion, maybe composite readers could implement some collection APIs and also have the ReaderUtil method directly built in (possibly as a view in the util.Collection sense). In general composite readers do not really need to look like the previous IndexReaders, they could simply be a collection of SegmentReaders with some functionality like reopen. On the other side, atomic readers do not need reopen logic anymore? When a segment changes, you need a new atomic reader? - maybe because of deletions thats not the best idea, but we should investigate. Maybe make the whole reopen logic simplier to use (ast least on the collection reader level). We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically
[ https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981774#action_12981774 ] Earwin Burrfoot commented on LUCENE-2868: - We here use an intermediate query AST, with a number of walkers that do synonym substitution, optimization, caching, rewriting for multiple fields, and finally - generating a tree of Lucene Queries. I can share a generic reflection-based visitor that's somewhat more handy than default visitor pattern in java. Usage looks roughly like: {code} class ToStringWalker extends DispatchingVisitorString { // String here stands for the type of walk result String visit(TermQuery q) { return {term: + q.getTerm() + }; } String visit(BooleanQuery q) { StringBuffer buf = new StringBuffer(); buf.append({boolean: ); for (BooleanQuery.Clause clause: q.clauses()) { buf.append(dispatch(clause.getQuery()).append(, ); // Here we } buf.append(}); return buf.toString(); } String visit(SpanQuery q) { // Runs for all SpanQueries . } String visit(Query q) { // Runs for all Queries not covered by a more exact visit() method .. } } Query query = ...; String stringRepresentation = new ToStringWalker().dispatch(query); {code} dispatch() checks its parameter runtime type, picks a visit()'s most close overload (according to java rules for compile-time overloaded method resolution), and invokes it. It should be easy to make use of TermState; rewritten queries should be shared automatically Key: LUCENE-2868 URL: https://issues.apache.org/jira/browse/LUCENE-2868 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Karl Wright Attachments: query-rewriter.patch When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. This can be most readily (and powerfully) done by introducing a new method to Query.java: Query rewriteUsingCache(IndexReader indexReader) ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981388#action_12981388 ] Earwin Burrfoot commented on LUCENE-2324: - Maan, this comment list is infinite. How do I currently get the ..er.. current version? Latest branch + latest Jason's patch? Regardless of everything else, I'd ask you not to extend random things :) at least if you can't say is-a about them. DocumentsWriterPerThreadPool.ThreadState IS A ReentrantLock? No. So you're better off encapsulating it rather than extending. Same can be applied to SegmentInfos that extends Vector :/ Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980649#action_12980649 ] Earwin Burrfoot commented on LUCENE-2793: - What's with ongoing crazyness? :) bq. DirectIOLinuxDirectory First you introduce a kind of directory that is utterly useless except certain special situations. Then, instead of fixing the directory/folding its code somewhere normal, you try to workaround by switching between directories. What's the point of using abstract classes or interfaces, if you leak their implementation's logic all over the place? Or making DIOLD wrap something. Yeah! Wrap my RAMDir! bq. bufferSize This value is only meaningful to a certain subset of Directory implementations. So the only logical place we want to see this value set - is these very impls. Sample code: {code} Directory ramDir = new RAMDirectory(); ramDir.createIndexInput(name, context); // See, ma? No bufferSizes, they are pointless for RAMDir Directory fsDir = new NIOFSDirectory(); fsDir.setBufferSize(IOContext.NORMAL_READ, 1024); fsDir.setBufferSize(IOContext.MERGE, 4096); fsDir.createIndexInput(name, context) // See, ma? The only one who's really concerned with 'actual' buffer size is this concrete Directory impl // All client code is only concerned with the context. // It's NIOFSDirectory's business to give meaningful interpretation for IOContext and assign the buffer sizes. {code} You don't need custom Directory impls to make DIOLD work, you should freakin' fix it. The proper way is to test out the things, and then move DirectIO code to the only place it makes sense in - FSDir? Probably make it switch on/off-able, maybe not. You don't need custom Directory impls to set buffer sizes (neither cast to BufferedIndexInput!), you should add the setting to these Directories, which make sense of it. Directory createOutput and openInput should take an IOContext - Key: LUCENE-2793 URL: https://issues.apache.org/jira/browse/LUCENE-2793 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Attachments: LUCENE-2793.patch Today for merging we pass down a larger readBufferSize than for searching because we get better performance. I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc. Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging. This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different -- we could in theory share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980732#action_12980732 ] Earwin Burrfoot commented on LUCENE-2793: - bq. Because in your example code above, it looks like it's added to Directory itself. bq. My problem with your sample code is that it appears that the .setBufferSize method is on Directory itself. Ohoho. My fault, sorry. It should look like: {code} RAMDirectory ramDir = new RAMDirectory(); ramDir.setBufferSize(whatever) // Compilation error! ramDir.createIndexInput(name, context); NIOFSDirectory fsDir = new NIOFSDirectory(); fsDir.setBufferSize(IOContext.NORMAL_READ, 1024); fsDir.setBufferSize(IOContext.MERGE, 4096); fsDir.createIndexInput(name, context) {code} Directory createOutput and openInput should take an IOContext - Key: LUCENE-2793 URL: https://issues.apache.org/jira/browse/LUCENE-2793 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Attachments: LUCENE-2793.patch Today for merging we pass down a larger readBufferSize than for searching because we get better performance. I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc. Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging. This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different -- we could in theory share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980736#action_12980736 ] Earwin Burrfoot commented on LUCENE-2793: - {quote} As I said before though, i wouldn't mind if we had something more like a 'modules/native' and FSDirectory checked, if this was available and automagically used it... but I can't see myself thinking that we should put this logic into fsdir itself, sorry. {quote} I'm perfectly OK with that approach (having some module FSDir checks). I also feel uneasy having JNI in core. What I don't want to see, is Directory impls that you can't use on their own. If you can only use it for merging, then it's not a Directory, it breaks the contract! - move the code elsewhere. Directory createOutput and openInput should take an IOContext - Key: LUCENE-2793 URL: https://issues.apache.org/jira/browse/LUCENE-2793 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Attachments: LUCENE-2793.patch Today for merging we pass down a larger readBufferSize than for searching because we get better performance. I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc. Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging. This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different -- we could in theory share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980388#action_12980388 ] Earwin Burrfoot commented on LUCENE-2858: - bq. On the other side, atomic readers do not need reopen logic anymore? When a segment changes, you need a new atomic reader? There is a freakload of places that upgrade SegmentReader in various ways, with deletions guilty only for the part of the cases. I'll try getting back to LUCENE-2355 at the end of the week. Separate SegmentReaders (and other atomic readers) from composite IndexReaders -- Key: LUCENE-2858 URL: https://issues.apache.org/jira/browse/LUCENE-2858 Project: Lucene - Java Issue Type: Task Reporter: Uwe Schindler Fix For: 4.0 With current trunk, whenever you open an IndexReader on a directory you get back a DirectoryReader which is a composite reader. The interface of IndexReader has now lots of methods that simply throw UOE (in fact more than 50% of all methods that are commonly used ones are unuseable now). This confuses users and makes the API hard to understand. This issue should split atomic readers from reader collections with a separate API. After that, you are no longer able, to get TermsEnum without wrapping from those composite readers. We currently have helper classes for wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or Multi*), those should be retrofitted to implement the correct classes (SlowMultiReaderWrapper would be an atomic reader but takes a composite reader as ctor param, maybe it could also simply take a ListAtomicReader). In my opinion, maybe composite readers could implement some collection APIs and also have the ReaderUtil method directly built in (possibly as a view in the util.Collection sense). In general composite readers do not really need to look like the previous IndexReaders, they could simply be a collection of SegmentReaders with some functionality like reopen. On the other side, atomic readers do not need reopen logic anymore? When a segment changes, you need a new atomic reader? - maybe because of deletions thats not the best idea, but we should investigate. Maybe make the whole reopen logic simplier to use (ast least on the collection reader level). We should decide about good names, i have no preference at the moment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges
[ https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980390#action_12980390 ] Earwin Burrfoot commented on LUCENE-2856: - A CompositeSegmentListener niftily removes the need for collection. Create IndexWriter event listener, specifically for merges -- Key: LUCENE-2856 URL: https://issues.apache.org/jira/browse/LUCENE-2856 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Jason Rutherglen Attachments: LUCENE-2856.patch The issue will allow users to monitor merges occurring within IndexWriter using a callback notifier event listener. This can be used by external applications such as Solr to monitor large segment merges. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980400#action_12980400 ] Earwin Burrfoot commented on LUCENE-2793: - Looks crazy. In a -bad- tangled way. You get IOFactory from Directory, put into IOContext, and then invoke it, passing it (wow!) an IOContext and a Directory. What if you pass totally different Directory? Different IOContext? It blows up eerily. And there's no justification for this - we already have an IOFactory, it's called Directory! It just needs an extra parameter on its factory methods (createInput/Output), that's all. Directory createOutput and openInput should take an IOContext - Key: LUCENE-2793 URL: https://issues.apache.org/jira/browse/LUCENE-2793 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Attachments: LUCENE-2793.patch Today for merging we pass down a larger readBufferSize than for searching because we get better performance. I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc. Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging. This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different -- we could in theory share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges
[ https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980448#action_12980448 ] Earwin Burrfoot commented on LUCENE-2856: - A SegmentListener that has a number of children SLs and delegates eventHappened() calls to them. Create IndexWriter event listener, specifically for merges -- Key: LUCENE-2856 URL: https://issues.apache.org/jira/browse/LUCENE-2856 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Jason Rutherglen Attachments: LUCENE-2856.patch The issue will allow users to monitor merges occurring within IndexWriter using a callback notifier event listener. This can be used by external applications such as Solr to monitor large segment merges. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980454#action_12980454 ] Earwin Burrfoot commented on LUCENE-2793: - {quote} bq. You get IOFactory from Directory That's for the default, the main use is the static IOFactory class. {quote} You lost me here. If you got A from B, you don't have to pass B again to invoke A, if you do - that's 99% a design mistake. But still, my point was that you don't need IOFactory at all. bq. Right, however we're basically trying to intermix Directory's, which doesn't work when pointed at the same underlying File. I thought about a meta-Directory that routes based on the IOContext, however we'd still need a way to create an IndexInput and IndexOutput, from different Directory implementations. What Directories are you trying to intermix? What for? I thought the only thing done in that issue is an attempt to give Directory hints as to why we're going to open its streams. A simple enum IOContext and extra parameter on createOutput/Input would suffice. But with Lucene's micromanagement attitude, an enum turns into slightly more complex thing, with bufferSizes and whatnot. Still - no need for mixing Directories. Directory createOutput and openInput should take an IOContext - Key: LUCENE-2793 URL: https://issues.apache.org/jira/browse/LUCENE-2793 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Attachments: LUCENE-2793.patch Today for merging we pass down a larger readBufferSize than for searching because we get better performance. I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc. Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging. This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different -- we could in theory share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980458#action_12980458 ] Earwin Burrfoot commented on LUCENE-2793: - In fact, I suggest dropping bufferSize altogether. As far as I can recall, it was introduced as a precursor to IOContext and can now be safely replaced. Even if we want to give user control over buffer size for all streams, or only those opened in specific IOContext, he can pass these numbers as config parameters to his Directory impl. That makes total sense, as: 1. IndexWriter/IndexReader couldn't care less about buffer sizes, it just passes them to the Directory. It's not their concern. 2. A bunch of Directories doesn't use said bufferSize at all, making this parameter not only private Directory affairs, but even further - implementation-specific. So my bet is - introduce IOContext as a simple Enum, change bufferSize parameter on createInput/Output to IOContext, done. Directory createOutput and openInput should take an IOContext - Key: LUCENE-2793 URL: https://issues.apache.org/jira/browse/LUCENE-2793 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Attachments: LUCENE-2793.patch Today for merging we pass down a larger readBufferSize than for searching because we get better performance. I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc. Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging. This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different -- we could in theory share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979522#action_12979522 ] Earwin Burrfoot commented on LUCENE-2312: - Some questions to align myself with impending reality. Is that right that future RT readers are no longer immutable snapshots (in a sense that they have variable maxDoc)? If it is so, are you keeping current NRT mode, with fast turnaround, yet immutable readers? Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: Realtime Branch Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: Realtime Branch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2474) Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey)
[ https://issues.apache.org/jira/browse/LUCENE-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979888#action_12979888 ] Earwin Burrfoot commented on LUCENE-2474: - bq. Earwin's working on improving this, I think, under LUCENE-2355 I stalled, and then there were just so many changes under trunk, so I have to restart now :) Thanks for another kick. Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey) Key: LUCENE-2474 URL: https://issues.apache.org/jira/browse/LUCENE-2474 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shay Banon Attachments: LUCENE-2474.patch, LUCENE-2474.patch Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey). A spin of: https://issues.apache.org/jira/browse/LUCENE-2468. Basically, its make a lot of sense to cache things based on IndexReader#getFieldCacheKey, even Lucene itself uses it, for example, with the CachingWrapperFilter. FieldCache enjoys being called explicitly to purge its cache when possible (which is tricky to know from the outside, especially when using NRT - reader attack of the clones). The provided patch allows to plug a CacheEvictionListener which will be called when the cache should be purged for an IndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)
[ https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979276#action_12979276 ] Earwin Burrfoot commented on LUCENE-2840: - bq. But doesn't that mean that an app w/ rare queries but each query is massive fails to use all available concurrency? Yes. But that's not my case. And likely not someone else's. I think if you want to be super-generic, it's better to defer exact threading to the user, instead of doing a one-size-fits-all solution. Else you risk conjuring another ConcurrentMergeScheduler. While we're at it, we can throw in some sample implementation, which can satisfy some of the users, but not everyone. Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher) --- Key: LUCENE-2840 URL: https://issues.apache.org/jira/browse/LUCENE-2840 Project: Lucene - Java Issue Type: Sub-task Components: Search Reporter: Uwe Schindler Priority: Minor Fix For: 4.0 Spin-off from parent issue: {quote} We should discuss about how many threads should be spawned. If you have an index with many segments, even small ones, I think only the larger segments should be separate threads, all others should be handled sequentially. So maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then only spawn maxThreads-1 threads for the bigger readers and then one additional thread for the rest? {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979277#action_12979277 ] Earwin Burrfoot commented on LUCENE-2843: - And we're nearing a day when we keep the whole term dictionary in memory (as Sphinx does for instance). At that point a gazillion of term lookup-related hacks (like lookup cache) become obsolete :) Term dictionary itself can also be memory-mapped after this, instead of being read and built from disk, which makes new segment opening near-instantaneous. Add variable-gap terms index impl. -- Key: LUCENE-2843 URL: https://issues.apache.org/jira/browse/LUCENE-2843 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2843.patch, LUCENE-2843.patch PrefixCodedTermsReader/Writer (used by all real core codecs) already supports pluggable terms index impls. The only impl we have now is FixedGapTermsIndexReader/Writer, which picks every Nth (default 32) term and holds it in efficient packed int/byte arrays in RAM. This is already an enormous improvement (RAM reduction, init time) over 3.x. This patch adds another impl, VariableGapTermsIndexReader/Writer, which lets you specify an arbitrary IndexTermSelector to pick which terms are indexed, and then uses an FST to hold the indexed terms. This is typically even more memory efficient than packed int/byte arrays, though, it does not support ord() so it's not quite a fair comparison. I had to relax the terms index plugin api for PrefixCodedTermsReader/Writer to not assume that the terms index impl supports ord. I also did some cleanup of the FST/FSTEnum APIs and impls, and broke out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor when the FST is used as a terms index but seekCeil when it's holding all terms in the index (ie which SimpleText uses FSTs for). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979305#action_12979305 ] Earwin Burrfoot commented on LUCENE-2843: - As I said, there's already a search server with strictly in-memory (in mmap sense. it can theoretically be paged out) terms dict AND widespread adoption. Their users somehow manage. My guess is that's because people with insane number of terms store various crap like unique timestamps as terms. With CSF (attributes in Sphinx lingo), and some nice filters that can work over CSF, there's no longer any need to stuff your timestamps in the same place you stuff your texts. That can be reflected in documentation, and then, suddenly, we can drop on-disk only support. Add variable-gap terms index impl. -- Key: LUCENE-2843 URL: https://issues.apache.org/jira/browse/LUCENE-2843 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2843.patch, LUCENE-2843.patch PrefixCodedTermsReader/Writer (used by all real core codecs) already supports pluggable terms index impls. The only impl we have now is FixedGapTermsIndexReader/Writer, which picks every Nth (default 32) term and holds it in efficient packed int/byte arrays in RAM. This is already an enormous improvement (RAM reduction, init time) over 3.x. This patch adds another impl, VariableGapTermsIndexReader/Writer, which lets you specify an arbitrary IndexTermSelector to pick which terms are indexed, and then uses an FST to hold the indexed terms. This is typically even more memory efficient than packed int/byte arrays, though, it does not support ord() so it's not quite a fair comparison. I had to relax the terms index plugin api for PrefixCodedTermsReader/Writer to not assume that the terms index impl supports ord. I also did some cleanup of the FST/FSTEnum APIs and impls, and broke out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor when the FST is used as a terms index but seekCeil when it's holding all terms in the index (ie which SimpleText uses FSTs for). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)
[ https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979306#action_12979306 ] Earwin Burrfoot commented on LUCENE-2840: - A lot of fork-join type frameworks don't even care. Even though scheduling threads is something people supposedly use them for. Why? I guess that's due to low yield/cost ratio. You frequently quote progress, not perfection in relation to the code, but why don't we apply this same principle to our threading guarantees? I don't want to use allowed concurrency fully. That's not realistic. I want 85% of it. That's already a huge leap ahead of single-threaded searches. Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher) --- Key: LUCENE-2840 URL: https://issues.apache.org/jira/browse/LUCENE-2840 Project: Lucene - Java Issue Type: Sub-task Components: Search Reporter: Uwe Schindler Priority: Minor Fix For: 4.0 Spin-off from parent issue: {quote} We should discuss about how many threads should be spawned. If you have an index with many segments, even small ones, I think only the larger segments should be separate threads, all others should be handled sequentially. So maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then only spawn maxThreads-1 threads for the bigger readers and then one additional thread for the rest? {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979346#action_12979346 ] Earwin Burrfoot commented on LUCENE-2843: - bq. I don't like the reasoning that, just because sphinx does it and their 'users manage', that makes it ok. I'm in no way advocating it as an all-round better solution. It has it's wrinkles just as anything else. My reasoning is merely that alternative exists, and it is viable. As proven by pretty high-profile users. They have memory-resident term dictionary, and it works, I heard no complaints regarding this ever. bq. sphinx also requires mysql Have you read anything at all? It has an integration ready, for the layman user who just wants to stick a fulltext search into their little app, but it is in no way reliant on it. Sphinx is a direct alternative to Solr. {quote} But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good decisions on what gets swapped out is risky - Lucene is better informed than the OS on which data structures are worth spending RAM on (norms, terms index, field cache, del docs). If indeed the terms dict (thanks to FSTs) becomes small enough to fit in RAM, then we should load it into RAM (and do away w/ the terms index). {quote} That's a bit delusional. If a system is forced to swap out, it'll swap your explicitly managed RAM just as likely as memory-mapped files. I've seen this countless times. But then, you have a number of benefits - like sharing filesystem cache when opening same file multiple times, offloading things from Java heap (which is almost always a good thing), fastest load-into-memory times possible. Sorry, if I sound offending at times, but, damn, there's a whole world of simple and efficient code lying ahead in that direction :) Add variable-gap terms index impl. -- Key: LUCENE-2843 URL: https://issues.apache.org/jira/browse/LUCENE-2843 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2843.patch, LUCENE-2843.patch PrefixCodedTermsReader/Writer (used by all real core codecs) already supports pluggable terms index impls. The only impl we have now is FixedGapTermsIndexReader/Writer, which picks every Nth (default 32) term and holds it in efficient packed int/byte arrays in RAM. This is already an enormous improvement (RAM reduction, init time) over 3.x. This patch adds another impl, VariableGapTermsIndexReader/Writer, which lets you specify an arbitrary IndexTermSelector to pick which terms are indexed, and then uses an FST to hold the indexed terms. This is typically even more memory efficient than packed int/byte arrays, though, it does not support ord() so it's not quite a fair comparison. I had to relax the terms index plugin api for PrefixCodedTermsReader/Writer to not assume that the terms index impl supports ord. I also did some cleanup of the FST/FSTEnum APIs and impls, and broke out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor when the FST is used as a terms index but seekCeil when it's holding all terms in the index (ie which SimpleText uses FSTs for). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979366#action_12979366 ] Earwin Burrfoot commented on LUCENE-2843: - bq. Nope, havent looked at their code... i think i stopped at the documentation when i saw how they analyzed text! All my points are contained within their documentation. No need to look at the code (it's as shady as Lucene's). In the same manner, Lucene had crappy analyzis for years, until you've taken hold of (unicode) police baton. So let's not allow color differences between our analyzers affect our judgement on other parts of ours : ) bq. In other words, Test2BTerms in src/test should pass on my 32-bit windows machine with whatever we default to. I'm questioning is there any legal, adequate reason to have that much terms. I'm agreeing on mmap+32bit/mmap+windows point for reasonable amount of terms though :/ A hybrid solution, with term-dict being loaded completely into memory (either via mmap, or into arrays) on per-field basis, is probably best in the end, however sad it may be. Add variable-gap terms index impl. -- Key: LUCENE-2843 URL: https://issues.apache.org/jira/browse/LUCENE-2843 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2843.patch, LUCENE-2843.patch PrefixCodedTermsReader/Writer (used by all real core codecs) already supports pluggable terms index impls. The only impl we have now is FixedGapTermsIndexReader/Writer, which picks every Nth (default 32) term and holds it in efficient packed int/byte arrays in RAM. This is already an enormous improvement (RAM reduction, init time) over 3.x. This patch adds another impl, VariableGapTermsIndexReader/Writer, which lets you specify an arbitrary IndexTermSelector to pick which terms are indexed, and then uses an FST to hold the indexed terms. This is typically even more memory efficient than packed int/byte arrays, though, it does not support ord() so it's not quite a fair comparison. I had to relax the terms index plugin api for PrefixCodedTermsReader/Writer to not assume that the terms index impl supports ord. I also did some cleanup of the FST/FSTEnum APIs and impls, and broke out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor when the FST is used as a terms index but seekCeil when it's holding all terms in the index (ie which SimpleText uses FSTs for). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets
On Mon, Jan 3, 2011 at 18:18, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / Cominventjan@cominvent.com wrote: The problem with large start is probably worse when sharding is involved. Anyone know how the shard component goes about fetching start=100rows=10 from say 10 shards? Does it have to merge sorted lists of 1mill+10 docsids from each shard which is the worst case? Yep, that's how it works today. Technically, if your docs have a non-biased (in regards to their sort-value) distribution across shards, you can fetch much less than topN docs from each shard. I played with the idea, and it worked for me. Though later I dropped the opto, as it complicated things somewhat and my users aren't querying gazillions of docs often. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)
[ https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12976027#action_12976027 ] Earwin Burrfoot commented on LUCENE-2840: - I use the following scheme: * There is a fixed pool of threads shared by all searches, that limits total concurrency. * Each new search apprehends at most a fixed number of threads from this pool (say, 2-3 of 8 in my setup), * and these threads churn through segments as through a queue (in maxDoc order, but I think even that is unnecessary). No special smart binding between threads and segments (eg. 1 thread for each biggie, 1 thread for all of the small ones) - means simpler code, and zero possibility of stalling, when there are threads to run, segments to search, but binding policy does not connect them. Using fewer threads per-search than total available is a precaution against biggie searches blocking fast ones. Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher) --- Key: LUCENE-2840 URL: https://issues.apache.org/jira/browse/LUCENE-2840 Project: Lucene - Java Issue Type: Sub-task Components: Search Reporter: Uwe Schindler Priority: Minor Fix For: 4.0 Spin-off from parent issue: {quote} We should discuss about how many threads should be spawned. If you have an index with many segments, even small ones, I think only the larger segments should be separate threads, all others should be handled sequentially. So maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then only spawn maxThreads-1 threads for the bigger readers and then one additional thread for the rest? {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a single on just like solr distributed search. But it's better than solr distributed search. First, it's in a single process and data communication between threads is much faster than network. Second, each threads process the same number of documents.For solr distributed search, one shard may process 7 documents and another shard may 1 document Even if we can make each shard have the same document number. we can not make it uniformly for each term. e.g. shard1 has doc1 doc2 shard2 has doc3 doc4 but term1 may only occur in doc1 and doc2 while term2 may only occur in doc3 and doc4 we may modify it shard1 doc1 doc3 shard2 doc2 doc4 it's good for term1 and term2 but term3 may occur in doc1 and doc3... So I think it's fine-grained distributed in index while solr distributed search is coarse- grained. This is just crazy :) The simple way is just to search different segments in parallel. BalancedSegmentMergePolicy makes sure you have roughly even-sized large segments (and small ones don't count, they're small!). If you're bound on squeezing out that extra millisecond (and making your life miserable along the way), you can search a single segment with multiple threads (by dividing it in even chunks, and then doing skipTo to position your iterators to the beginning of each chunk). First approach is really easy to implement. Second one is harder, but still doesn't require you to cook the number of CPU cores available into your index! It's the law of diminishing returns at play here. You're most likely to search in parallel over mostly memory-resident index (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems tend to slow down considerably on parallel sequential reads, so you already have pretty decent speed. Searching different segments in parallel (with BSMP) makes you several times faster. Searching in parallel within a segment requires some weird hacks, but has maybe a few percent advantage over previous solution. Sharding posting lists requires a great deal of weird hacks, makes index machine-bound, and boosts speed by another couple of percent. Sounds worthless. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: is the classes ended with PerThread(*PerThread) multithread
There is a single indexchain, with a single instance of each chain component, except those ending in -PerThread. Though that's gonna change with https://issues.apache.org/jira/browse/LUCENE-2324 On Tue, Dec 28, 2010 at 13:10, Simon Willnauer simon.willna...@googlemail.com wrote: On Tue, Dec 28, 2010 at 10:57 AM, xu cheng xcheng@gmail.com wrote: hi simon thanks for replying very much. after reading the source code with your suggestion, here's my understanding, and I don't know whether it's right: the DocumentsWriter actually don't create threads, but the codes that use DocumentsWriter can do the multithreading(say, several threads call updateDocument). and each thread has its DocumentsWriterThreadState, in the mean while, each DocumentsWriterThreadState has its own objects(the *PerThread such as DocFieldProcessorPerThread, DocInverterPerThread and so on ) as the methods of DocumentsWriter are called by multiple threads, for example, 4 threads, there are 4 DocumentsWriterThreadState objects, and 4 index chains, ( each index chain has it's own *PerThread objects , to process the document). am I right?? that sounds about right simon thanks for replying again! 2010/12/28 Simon Willnauer simon.willna...@googlemail.com Hey there, so what you are looking at are classes that are created per Thread rather than shared with other threads. Lucene internally rarely creates threads or subclasses Thread, Runnable or Callable (ParallelMultiSearcher is an exception or some of the merging code). Yet, inside the indexer when you add (update) a document Lucene utilizes the callers thread rather than spanning a new one. When you look at DocumentsWriter.java there should be a method callled getThreadState. Each indexing thread, lets say in updateDocument, gets its Thread-Private DocumentsWriterThreadState. This thread state holds a DocConsumerPerThread obtained from the DocumentsWriters DocConsumer (see the indexing chain). DocConsumerPerThread in that case is some kind of decorator that hold other DocConsumerPerThread instances like TermsHashPerThread etc. The general pattern is for each DocConsumer you can get a DocConsumerPerThread for your indexing thread which then consumes the document you are processing right now. I hope that helps simon On Tue, Dec 28, 2010 at 4:19 AM, xu cheng xcheng@gmail.com wrote: hi all: I'm new to dev these days I'm reading the source code in the index package and I was confused. there are classes with suffix PerThread such as DocFieldProcessorPerThread, DocInverterPerThread, TermsHashPerThread, FreqProxTermWriterPerThread. in this mailing-list, I was told that they are multithreaded. however, there are some difficulties for me to understand! I see no sign that they inherited from the Thread , or implement the Runnable, or something else?? how do they map to the OS thread?? thanks ^_^ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974274#action_12974274 ] Earwin Burrfoot commented on LUCENE-2829: - Term lookup misses can be alleviated by a simple Bloom Filter. No caching misses required, helps both PK and near-PK queries. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance
[ https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974350#action_12974350 ] Earwin Burrfoot commented on LUCENE-2829: - Nobody halts your progress, we're merely discussing. I, on the other hand, have a feeling that Lucene is overflowing with single incremental improvements aka hacks, as they are easier and faster to implement than trying to get a bigger picture, and, yes, rebuilding everything :) For example, better term dict code will make this issue (somewhat hackish, admit it?) irrelevant. Whether we implement bloom filters, or just guarantee to keep the whole term dict in memory with reasonable lookup routine (eg. as FST). Having said that, I reiterate, I'm not here to stop you or turn this issue into something else. improve termquery pk lookup performance - Key: LUCENE-2829 URL: https://issues.apache.org/jira/browse/LUCENE-2829 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Attachments: LUCENE-2829.patch For things that are like primary keys and don't exist in some segments (worst case is primary/unique key that only exists in 1) we do wasted seeks. While LUCENE-2694 tries to solve some of this issue with TermState, I'm concerned we could every backport that to 3.1 for example. This is a simpler solution here just to solve this one problem in termquery... we could just revert it in trunk when we resolve LUCENE-2694, but I don't think we should leave things as they are in 3.x -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: RT branch status
Cool! I'm getting to this on a weekend. On Tue, Dec 21, 2010 at 11:44, Michael Busch busch...@gmail.com wrote: After merging trunk into the RT branch it's finally compiling again and up-to-date. Several tests are failing now after the merge (43 out of 1427 are failing), which is not too surprising, because so many things have changed (segment-deletes, flush control, termsHash refactoring, removal of doc stores, etc). Especially IndexWriter and DocumentsWriter are in a somewhat messy state, but I wanted to share my current state, so I committed the merge. I'll try this week to understand the new changes (especially deletes) and make them work with the DWPT. The following areas need work: * deletes * thread-safety * error handling and aborting * flush-by-ram (LUCENE-2573) Also, some tests deadlock. Not surprisingly either, cause flushcontrol etc. introduce new synchronized blocks. Before the merge all tests were passing, except the ones testing flush-by-ram functionality. I'll keep working on getting the branch back into that state again soon. Help is definitely welcome! I'd love to get this branch ready so that we can merge it into trunk as soon as possible. As Mike's experiments show having DWPTs will not only be beneficial for RT search, but also increase indexing performance in general. Michael PS: Thanks for the patience! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Do we want 'nocommit' to fail the commit?
But. Er. What if we happen to have nocommit in a string, or in some docs, or as a name of variable? On Sat, Dec 18, 2010 at 12:47, Michael McCandless luc...@mikemccandless.com wrote: +1 this would be great :) Mike On Fri, Dec 17, 2010 at 10:45 PM, Shai Erera ser...@gmail.com wrote: Hi Out of curiosity, I searched if we can have a nocommit comment in the code fail the commit. As far as I see, we try to avoid accidental commits (of say debug messages) by putting a nocommit comment, but I don't know if svn ci would fail in the presence of such comment - I guess not because we've seen some accidental nocommits checked in already in the past. So I Googled around and found that if we have control of the svn repo, we can add a pre-commit hook that will check and fail the commit. Here is a nice article that explains how to add pre-commit hooks in general (http://wordaligned.org/articles/a-subversion-pre-commit-hook). I didn't try it yet (on our local svn instance), so I cannot say how well it works, but perhaps someone has experience with it ... So if this is interesting, and is doable for Lucene (say, open a JIRA issue for Infra?) I don't mind investigating it further and write the script (which can be as simple as 'grep the changed files and fail on the presence of nocommit string'). Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2818) abort() method for IndexOutput
[ https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972764#action_12972764 ] Earwin Burrfoot commented on LUCENE-2818: - bq. Can abort() have a default impl in IndexOutput, such as close() followed by deleteFile() maybe? If so, then it won't break anything. It can't. To call deleteFile you need both a reference to papa-Directory and a name of the file this IO writes to. Abstract IO class has neither. If we add them, they have to be passed to a new constructor, and that's an API break ;) bq. Would abort() on Directory fit better? E.g., it can abort all currently open and modified files, instead of the caller calling abort() on each IndexOutput? Are you thinking of a case where a write failed, and the caller would call abort() immediately, instead of some higher-level code? If so, would rollback() be a better name? Oh, no, no. No way. I don't want to push someone else's responsibility on Directory. This abort() is merely a shortcut. Let's go with a usage example: Here's FieldsWriter.java with LUCENE-2814 applied (skipping irrelevant parts) - https://gist.github.com/746358 Now, the same, with abort() - https://gist.github.com/746367 abort() method for IndexOutput -- Key: LUCENE-2818 URL: https://issues.apache.org/jira/browse/LUCENE-2818 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot I'd like to see abort() method on IndexOutput that silently (no exceptions) closes IO and then does silent papaDir.deleteFile(this.fileName()). This will simplify a bunch of error recovery code for IndexWriter and friends, but constitutes an API backcompat break. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2818) abort() method for IndexOutput
[ https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972765#action_12972765 ] Earwin Burrfoot commented on LUCENE-2818: - bq. I think we can make a default impl that simply closes suppresses exceptions? (We can't .deleteFile since an abstract IO doesn't know its Dir). Our concrete impls can override w/ versions that do delete the file... I don't think we need a default impl? For some directory impls close() is a noop + what is more important, having abstract method forces you to implement it, you can't forget this, so we're not gonna see broken directories that don't do abort() properly. abort() method for IndexOutput -- Key: LUCENE-2818 URL: https://issues.apache.org/jira/browse/LUCENE-2818 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot I'd like to see abort() method on IndexOutput that silently (no exceptions) closes IO and then does silent papaDir.deleteFile(this.fileName()). This will simplify a bunch of error recovery code for IndexWriter and friends, but constitutes an API backcompat break. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2818) abort() method for IndexOutput
[ https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2818: Priority: Minor (was: Major) This change is really minor, but I think, convinient. You don't have to lug reference to Directory along, and recalculate the file name, if the only thing you want to say is that write was a failure and you no longer need this file. abort() method for IndexOutput -- Key: LUCENE-2818 URL: https://issues.apache.org/jira/browse/LUCENE-2818 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot Priority: Minor I'd like to see abort() method on IndexOutput that silently (no exceptions) closes IO and then does silent papaDir.deleteFile(this.fileName()). This will simplify a bunch of error recovery code for IndexWriter and friends, but constitutes an API backcompat break. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2814: Attachment: LUCENE-2814.patch Synced to trunk. bq. Also, on the nocommit on exc in DW.addDocument, yes I think that (IFD.deleteNewFiles, not checkpoint) is still needed because DW can orphan the store files on abort? Orphaned files are deleted directly in StoredFieldsWriter.abort() and TermVectorsTermsWriter.abort(). As I said - all the open files tracking is now gone. Turns out checkpoint() is also no longer needed. I have no other lingering cleanup urges, this is ready to be committed. I think. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2818) abort() method for IndexOutput
abort() method for IndexOutput -- Key: LUCENE-2818 URL: https://issues.apache.org/jira/browse/LUCENE-2818 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot I'd like to see abort() method on IndexOutput that silently (no exceptions) closes IO and then does silent papaDir.deleteFile(this.fileName()). This will simplify a bunch of error recovery code for IndexWriter and friends, but constitutes an API backcompat break. What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2814: Attachment: LUCENE-2814.patch New patch. Now with even more lines removed! DocStore-related index chain components used to track open/closed files through DocumentsWriter. Closed files list was unused, and is silently gone. Open files list was used to: * prevent not-yet-flushed shared docstores from being deleted by IndexFileDeleter. ** no shared docstores, no need + IFD no longer requires a reference to DW * delete already opened docstore files, when aborting. ** index chain now handles this on its own + has cleaner error handling code. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: LogMergePolicy.setUseCompoundFile/DocStore
Incoming LUCENE-2814 drops setUseCompoundDocStore() On Thu, Dec 16, 2010 at 12:04, Shai Erera ser...@gmail.com wrote: Hi I find it very annoying that I need to set true/false on these methods whenever I want to control compound files creation. Is it really necessary to allow writing doc stores in non compound files vs. the other index files in a compound file? Does somebody know if this feature is used somewhere? If it's crucial to keep the two methods, then how about introducing a setCompoundMode(true/false) to turn on/off both at once? IndexWriter used to have it, before we switched to IndexWriterConfig and I think it was very useful. Shai -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org