What about the index writing efficiency of large index ?
Hi All, Does anyone do a benchmark to verify the index writing efficiency of lucene? When the index size is larger than 10G, will it be much slower than smaller ones ? Actually i did some works about this issue, and i found that, if build small index firstly then merge them all, the time taken will significantly decline. Hi Yonik, How does Solr deal with this issue, does it just leave this problem to lucene? Thanks, Eric
Re: storing position - keyword
To confuse matters more, it is not really a matter of synonyms, as the orginal term is discarded from the index and there is only one mapped term I'm not sure I fully understand this: am I right in thinking that you will be searching using these controlled volcabulary words, and that the search must then find any of the ordinary words which map to the controled vocaburlary words, and highlight them? Because if that's the case, I think it's relatively simple: You create a separate index, which only maps the controlled vocubulary to the ordinary words. That's your synonyms index. Then, you index your target document as normal. When you search, you first look up your search term against the synonyms index. So, following your exmaple, if you looked up dog in the synoyms index, youd get back chien, canis and cane. (Achieving this part is easy: you just keep adding synonyms to the field at the same position.) Whether or not the returned list also contains the orignal dog is up to you when you create your synonyms index. (In a typical synonyms ring, the original word would have to be in there, because you don't know which word will be used to search) Now all you have to do is combine those returned terms as Boolean OR clauses in a single BooleanQuery, and search on the main index. You'll find all documents containing any of those 3 words, and you can use the highlighting code form the Lucene contrib projects to highlight Does this help? Forgive me if I've misunderstood or undersetimated the problem! Regards, -John per original term or phrase and the algorithm determines the controlled meaning from the context. 1world1love wrote: First off Karl, thanks for your reply and your time. karl wettin-3 wrote: One could also say you are classifying your data based on keywords in the text? I probably didn't explain myself very well or more specifically provide a good example. In my case, there really isn't any relationship between the mapped terms per document. That is to say that an individual term or phrase in the document is mapped to a concrete concept in a controlled vocabulary. The concept doesn't represent a class of anything and no relationship exists between the concepts. They would never be grouped by any means. It is more a matter of replacing some arbitrary word or phrase with an adjudicated version. The example I gave did in fact use classifications for the terms, but that is not exactly the point that I was trying to convey. I suppose a better example would be where each term or phrase in the sentence mapped to any equivilent in another language: dog - canis dog - cane dog - chien So that if you searched for canis, then any document with dog would be returned (unless the context inferred that dog meant something else). By the same token, if the text was here we go or let's go, then it may map to vamos or vamonos. To confuse matters more, it is not really a matter of synonyms, as the orginal term is discarded from the index and there is only one mapped term per original term or phrase and the algorithm determines the controlled meaning from the context. karl wettin-3 wrote: You can always store values in a field, but the term and the stored value is not coupled. Thus you would need to store the positions per document in each field in machine readable format you then parse: doc.addField(f, keyword:12,32;54,32, Field.Store.YES, .. But that is a way expensive solution. Indeed, though doesn't a analyzed field have some other information attached to it? Forgive me if this is a naive question. I am fairly new to Lucene. karl wettin-3 wrote: This is known as faceted classification. http://en.wikipedia.org/wiki/Faceted_classification http://www.nabble.com/forum/Search.jtp?query=facetslocal=yforum=44 Again, I am not overly familiar with these disciplines, but I always thought of facets as a organizational strategy. As I said, my example betrayed me a bit, as I am not that interested in organizing these documents, rather providing a controlled vocabulary from which to search as opposed to any random text. karl wettin-3 wrote: Are you aware of the hightlighter contrib module? http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/ The simplest solution is to a new facet Term per classification in text and use the text start and end positions of the text field, and have the hightligher to load the text and highlight this text field. This is actually not a web based application and the highlighting would really only be used for analyzing performance of the mapping algorithms. The main issue is that we do need to be able to provide the location of the original term for each mapped keyword. karl wettin-3 wrote: Matching a document with the same terms occuring multiple times will cause a greater score than it only occuring once. This is probably problematic for you. It may not be that big of an
Re: Boolean Query search performance
2008/3/6, Chris Hostetter [EMAIL PROTECTED]: : If I do a query.toString(), both queries give different results, which : is probably a clue (additional paren's with the BooleanQuery) : : Query.toString the old way using queryParser: : +(id:1^2.0 id:2 ... ) +type:CORE : : Query.toString the new way using BooleanQuery: : +((id:1^2.0) (id:2) ... ) +type:CORE i didn't look too closely at the psuedo code you posted, but the additional parens normally indicates that you are actually creating an extra layer of BooleanQueries (ie: a BooleanQuery with only one clause for each term) ... but the rewrite method should optimize those away (even back in lucene 2.2) ... if you look at query.rewrite(reader).toString() then the queries *really* should be the same, if they aren't, then that may be your culprit. look here, parens will also be add is each term has a boost value larger than 1.0. public String toString(String field) { StringBuffer buffer = new StringBuffer(); boolean needParens=(getBoost() != 1.0) || (getMinimumNumberShouldMatch()0) ; if (needParens) { buffer.append((); } for (int i = 0 ; i clauses.size(); i++) { BooleanClause c = (BooleanClause)clauses.get(i); if (c.isProhibited()) buffer.append(-); else if (c.isRequired()) buffer.append(+); Query subQuery = c.getQuery(); if (subQuery instanceof BooleanQuery) { // wrap sub-bools in parens buffer.append((); buffer.append(c.getQuery().toString(field)); buffer.append()); } else buffer.append(c.getQuery().toString(field)); if (i != clauses.size()-1) buffer.append( ); } if (needParens) { buffer.append()); } if (getMinimumNumberShouldMatch()0) { buffer.append('~'); buffer.append(getMinimumNumberShouldMatch()); } if (getBoost() != 1.0f) { buffer.append(ToStringUtils.boost(getBoost())); } return buffer.toString(); } -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
combine wildcard and phrase query
hey everybody, I'm wondering if it's possible to combine wildcards and phrase query. For example term1 term* I know that the documentation says Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries) but maybe someone has had the same problem and found a solution. Thanks for your help Jens Burkhardt -- View this message in context: http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15870647.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Swapping between indexes
This is my situation. I have an index, which has a lot of search requests coming into it. I use just a single instance of IndexSearcher to process these requests. At the same time, this index is also getting updated by an IndexWriter. And I want these new changes to be reflected _only_ at certain intervals. I have thought of a few ways of doing this. Each has its share of problems and pluses. I would be glad if someone can help me in figuring out the right approach, especially from the performance point of view, as the number of documents that will get indexed are pretty large. Approach 1: Have just one copy of the index for both Search Index. At time T, when I need to see the new changes reflected, I close the Searcher, and open it again. - The re-open of the Searcher might be a bit slow (which I could probably solve by using some warm-up threads). - Update and Search on the index at the same - will this affect the performance? - If server crashes before time T, the new Searcher would reflect the changes, which is not acceptable. I want the changes to be reflected only at time T. If server crashes, the index should be the previous T-1 index. - Possible problems while optimising the index (as Search is also happening). + Just one copy of the index being stored. Approach 2: Keep 2 copies of the index - 1 for Search, 1 for Index. At time T, I just switch the Searcher to a copy of index that is being updated. - Before I do the switch to the new index, I need to make a copy of it so that the updates continue to happen on the other index. Is there a convenient way to make this copy? Is it efficient? - Time taken to create a new Searcher will still be a problem (but this is a problem in the previous approach as well, and we can live with it). + Optimise can happen on an index that is not being read, as a result, its resource requirements would be lesser. And probably even the speed of optimisation. + Faster search as the index update is happening on a different index. So, these are the 2 approaches I am contemplating about. Any pointers which would be the better approach? Thanks, Sridhar
Re: combine wildcard and phrase query
okay, another problem occured. I have different fields with the same name. I can't seperate them like naming them field1 field2 etc. cause while indexing i don't know how many fields i will need. Like a book has several signature numbers i want to save them in a field signature and when i search for such a number i want the search hit every single field and not all fields together. Right now i separate the string using an unique separator (in this case just $$$) so i can split the string into the numbers but i think this is kinda the worst form doing it. JensBurkhardt wrote: hey everybody, I'm wondering if it's possible to combine wildcards and phrase query. For example term1 term* I know that the documentation says Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries) but maybe someone has had the same problem and found a solution. Thanks for your help Jens Burkhardt -- View this message in context: http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Swapping between indexes
A simple variant on Approach 1 would be to open your writer with autoCommit=false. This way no reader will ever see the changes until you successfully close the writer. If the machine crashes the index is still in the starting state as of when the writer was first opened. Also, re-open of Approach 1 should be a bit (not a lot, though there is work to make it a lot) faster than wholly new open required in approach 2. There should not be problems optimizing while searching. Yes, you use more disk space, but no more (in fact, less) than approach 2 requires. I think approach 2 is only possibly better if the indexing would be done on a different computer / IO system. Mike Sridhar Raman wrote: This is my situation. I have an index, which has a lot of search requests coming into it. I use just a single instance of IndexSearcher to process these requests. At the same time, this index is also getting updated by an IndexWriter. And I want these new changes to be reflected _only_ at certain intervals. I have thought of a few ways of doing this. Each has its share of problems and pluses. I would be glad if someone can help me in figuring out the right approach, especially from the performance point of view, as the number of documents that will get indexed are pretty large. Approach 1: Have just one copy of the index for both Search Index. At time T, when I need to see the new changes reflected, I close the Searcher, and open it again. - The re-open of the Searcher might be a bit slow (which I could probably solve by using some warm-up threads). - Update and Search on the index at the same - will this affect the performance? - If server crashes before time T, the new Searcher would reflect the changes, which is not acceptable. I want the changes to be reflected only at time T. If server crashes, the index should be the previous T-1 index. - Possible problems while optimising the index (as Search is also happening). + Just one copy of the index being stored. Approach 2: Keep 2 copies of the index - 1 for Search, 1 for Index. At time T, I just switch the Searcher to a copy of index that is being updated. - Before I do the switch to the new index, I need to make a copy of it so that the updates continue to happen on the other index. Is there a convenient way to make this copy? Is it efficient? - Time taken to create a new Searcher will still be a problem (but this is a problem in the previous approach as well, and we can live with it). + Optimise can happen on an index that is not being read, as a result, its resource requirements would be lesser. And probably even the speed of optimisation. + Faster search as the index update is happening on a different index. So, these are the 2 approaches I am contemplating about. Any pointers which would be the better approach? Thanks, Sridhar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Swapping between indexes
Sridhar Raman wrote: This way no reader will ever see the changes until you successfully close the writer. If the machine crashes the index is still in the starting state as of when the writer was first opened. Ok, I have a slight doubt in this. Say I have gone ahead with Approach 1 If I have opened the writer with autoCommit=false, and the system crashes, does it mean that the changes made to IdxSrch are lost? If that is the case, that might be a problem. What I actually want is something like this. When the system crashes in between, the search continues to happen on the index at T0. But the updates that were done since T0 also needs to be preserved. Would that happen if I set autoCommit to false? I realise that I want the cake and eat it too. But that's the problem we face if we keep just a single copy of the index. Alas, you are right: all changes not committed are lost. Ie on coming back up after the crash, you would have to re-index everything again. Lucene is actually not that far from doing what you're asking for here. I think the only thing missing is the ability to open a reader on a prior commit, rather than the latest one. If we added that then you could make a custom deletion policy that'd keep your T0 commit, as well as commits being done by your writer, and only remove them when you decide to switch your readers to the current commit. But, realize that even with such a change to Lucene, you would still lose everything since the last commit, when the machine crashes. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Swapping between indexes
This way no reader will ever see the changes until you successfully close the writer. If the machine crashes the index is still in the starting state as of when the writer was first opened. Ok, I have a slight doubt in this. Say I have gone ahead with Approach 1 If I have opened the writer with autoCommit=false, and the system crashes, does it mean that the changes made to IdxSrch are lost? If that is the case, that might be a problem. What I actually want is something like this. When the system crashes in between, the search continues to happen on the index at T0. But the updates that were done since T0 also needs to be preserved. Would that happen if I set autoCommit to false? I realise that I want the cake and eat it too. But that's the problem we face if we keep just a single copy of the index. On Thu, Mar 6, 2008 at 4:58 PM, Michael McCandless [EMAIL PROTECTED] wrote: A simple variant on Approach 1 would be to open your writer with autoCommit=false. This way no reader will ever see the changes until you successfully close the writer. If the machine crashes the index is still in the starting state as of when the writer was first opened. Also, re-open of Approach 1 should be a bit (not a lot, though there is work to make it a lot) faster than wholly new open required in approach 2. There should not be problems optimizing while searching. Yes, you use more disk space, but no more (in fact, less) than approach 2 requires. I think approach 2 is only possibly better if the indexing would be done on a different computer / IO system. Mike Sridhar Raman wrote: This is my situation. I have an index, which has a lot of search requests coming into it. I use just a single instance of IndexSearcher to process these requests. At the same time, this index is also getting updated by an IndexWriter. And I want these new changes to be reflected _only_ at certain intervals. I have thought of a few ways of doing this. Each has its share of problems and pluses. I would be glad if someone can help me in figuring out the right approach, especially from the performance point of view, as the number of documents that will get indexed are pretty large. Approach 1: Have just one copy of the index for both Search Index. At time T, when I need to see the new changes reflected, I close the Searcher, and open it again. - The re-open of the Searcher might be a bit slow (which I could probably solve by using some warm-up threads). - Update and Search on the index at the same - will this affect the performance? - If server crashes before time T, the new Searcher would reflect the changes, which is not acceptable. I want the changes to be reflected only at time T. If server crashes, the index should be the previous T-1 index. - Possible problems while optimising the index (as Search is also happening). + Just one copy of the index being stored. Approach 2: Keep 2 copies of the index - 1 for Search, 1 for Index. At time T, I just switch the Searcher to a copy of index that is being updated. - Before I do the switch to the new index, I need to make a copy of it so that the updates continue to happen on the other index. Is there a convenient way to make this copy? Is it efficient? - Time taken to create a new Searcher will still be a problem (but this is a problem in the previous approach as well, and we can live with it). + Optimise can happen on an index that is not being read, as a result, its resource requirements would be lesser. And probably even the speed of optimisation. + Faster search as the index update is happening on a different index. So, these are the 2 approaches I am contemplating about. Any pointers which would be the better approach? Thanks, Sridhar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: combine wildcard and phrase query
No, as far as I know you can't combine wildcards in phrases. This would get extraordinarily ugly extraordinarily quickly. The way Lucene handles wildcards (conceputally) is to expand all the possible terms into a large OR clause. Say my index contains term1, term2, and term3. The search for term* really expands into term1 OR term2 OR term3. Now imagine the complexity of a phrase like dog* cat* hors*. Now say your index contained 10 terms starting with dog, 10 with cat and 10 with hors. You'd have 1,000 ORed phrase queries. And this is a tiny example You can try various approximations, and depending upon your index size they may or may not work. For instance, you could index all the successive shorter forms. with increments of 0 (see synonym analyzer) I.e. index horse, hors$ hor$ ho$ h$ all in the same position. Then searching for hor* becomes searching for hor$ and it all just works. Of course this makes your index bigger. About your second issue: I'm not clear what your trying to accomplish. It's no problem to add the same field multiple times for a document. That is, you can doc.add(new field(field1, ..) doc.add(new field(field1, ..) doc.add(new field(field1, ..) doc.add(new field(field1, ..) as many times as you want before you add the document to the index. For retrieval you can call getFields (field1) and get an array of Fields back, one for each call to add above. You can also set the PositionIncrementGap while indexing to separate the termposition of the first term of successive add() calls by, say, 100 (or whatever) if you need to worry about SpanNear or some such. This may be wy off base. If so, could you give a concrete example of what your inputs are and how you want to search them? Best Erick On Thu, Mar 6, 2008 at 7:28 AM, JensBurkhardt [EMAIL PROTECTED] wrote: okay, another problem occured. I have different fields with the same name. I can't seperate them like naming them field1 field2 etc. cause while indexing i don't know how many fields i will need. Like a book has several signature numbers i want to save them in a field signature and when i search for such a number i want the search hit every single field and not all fields together. Right now i separate the string using an unique separator (in this case just $$$) so i can split the string into the numbers but i think this is kinda the worst form doing it. JensBurkhardt wrote: hey everybody, I'm wondering if it's possible to combine wildcards and phrase query. For example term1 term* I know that the documentation says Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries) but maybe someone has had the same problem and found a solution. Thanks for your help Jens Burkhardt -- View this message in context: http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What about the index writing efficiency of large index ?
On Thu, Mar 6, 2008 at 3:57 AM, Eric Th [EMAIL PROTECTED] wrote: Hi All, Does anyone do a benchmark to verify the index writing efficiency of lucene? When the index size is larger than 10G, will it be much slower than smaller ones ? Actually i did some works about this issue, and i found that, if build small index firstly then merge them all, the time taken will significantly decline. You can increase the merge factor to get fewer merges, or you can set maxMergeDocs or setMaxMergeMB to prevent merging of any segments above a certain size and them call optimize at the end. Hi Yonik, How does Solr deal with this issue, does it just leave this problem to lucene? Solr leaves it to Lucene. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Swapping between indexes
On Thu, Mar 6, 2008 at 8:02 AM, Sridhar Raman [EMAIL PROTECTED] wrote: This way no reader will ever see the changes until you successfully close the writer. If the machine crashes the index is still in the starting state as of when the writer was first opened. Ok, I have a slight doubt in this. Say I have gone ahead with Approach 1 If I have opened the writer with autoCommit=false, and the system crashes, does it mean that the changes made to IdxSrch are lost? Since Lucene buffers in memory, you will always have the risk of losing recently added documents that haven't been flushed yet. Committing on every document would be too slow to be practical. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiSearcher to overcome the Integer.MAX_VALUE limit
Hey Guys, just a quick question to confirm an assumption I have. Is it correct that I can have around 100 Indexes each at its Integer.MAX_VALUE limit of documents, but can happily search them all with a MultiSearcher if all combined returned hits don't add up to the Integer.MAX_VALUE themselves ? Kind regards, Ray.
Re: combine wildcard and phrase query
okay thanks. the first thing was what i've expected :-) . well about my second issue, i was totally wrong. Just forget what i've said! I had in mind that if i have several fields with the same name these fields are connected to a big string. Now as i read your message i remember that this behavior is just with boost-values ;-) and not with field-values. hell no ... thanks for your answers :-). You saved a lot of time ;-) best regards Jens Erick Erickson wrote: No, as far as I know you can't combine wildcards in phrases. This would get extraordinarily ugly extraordinarily quickly. The way Lucene handles wildcards (conceputally) is to expand all the possible terms into a large OR clause. Say my index contains term1, term2, and term3. The search for term* really expands into term1 OR term2 OR term3. Now imagine the complexity of a phrase like dog* cat* hors*. Now say your index contained 10 terms starting with dog, 10 with cat and 10 with hors. You'd have 1,000 ORed phrase queries. And this is a tiny example You can try various approximations, and depending upon your index size they may or may not work. For instance, you could index all the successive shorter forms. with increments of 0 (see synonym analyzer) I.e. index horse, hors$ hor$ ho$ h$ all in the same position. Then searching for hor* becomes searching for hor$ and it all just works. Of course this makes your index bigger. About your second issue: I'm not clear what your trying to accomplish. It's no problem to add the same field multiple times for a document. That is, you can doc.add(new field(field1, ..) doc.add(new field(field1, ..) doc.add(new field(field1, ..) doc.add(new field(field1, ..) as many times as you want before you add the document to the index. For retrieval you can call getFields (field1) and get an array of Fields back, one for each call to add above. You can also set the PositionIncrementGap while indexing to separate the termposition of the first term of successive add() calls by, say, 100 (or whatever) if you need to worry about SpanNear or some such. This may be wy off base. If so, could you give a concrete example of what your inputs are and how you want to search them? Best Erick On Thu, Mar 6, 2008 at 7:28 AM, JensBurkhardt [EMAIL PROTECTED] wrote: okay, another problem occured. I have different fields with the same name. I can't seperate them like naming them field1 field2 etc. cause while indexing i don't know how many fields i will need. Like a book has several signature numbers i want to save them in a field signature and when i search for such a number i want the search hit every single field and not all fields together. Right now i separate the string using an unique separator (in this case just $$$) so i can split the string into the numbers but i think this is kinda the worst form doing it. JensBurkhardt wrote: hey everybody, I'm wondering if it's possible to combine wildcards and phrase query. For example term1 term* I know that the documentation says Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries) but maybe someone has had the same problem and found a solution. Thanks for your help Jens Burkhardt -- View this message in context: http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15874560.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Swapping between indexes
Since Lucene buffers in memory, you will always have the risk of losing recently added documents that haven't been flushed yet. Committing on every document would be too slow to be practical. Well it is not sooo slw... I have indexed 10.000 docs, resulting in 14 MB index. The index has 2 stored fields and the tokenized content field. With a commit after every add: 30 min. With a commit after 100 add: 23 min. Only one commit: 20 min. (including time to get the document from the archive) I use lucene 2.3 so a commit is a combination of closing and creating the writer. 2.4/3.0 has a commit method which may be faster. Before this test I thought it would be much slower than 30 min... So one has to decide if correctness is more important than performance. I use a batch size of 100, first committing lucene, then committing the database which holds the status of the document if it is already indexed or not. If the db commit fails it is no problem, because my app does not care about multiple indexed documents. But until now neither the lucene nor the db commit ever failed... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Swapping between indexes
With a commit after every add: 30 min. With a commit after 100 add: 23 min. Only one commit: 20 min. All of these times look pretty slow... perhaps lucene is not the bottleneck here? Therefore I wrote: (including time to get the document from the archive) Not the absolute times are important, the differences are imported. They only occur due to the different batch sizes. I think it is a real world scenario because one has always the read the docs from somewhere and offen has to store the index state somewhere else. A test with docs created in memory and no state in a database would have of cause completely other results. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Boolean Query search performance
Thanks for all replies. Today when I printed out the query that's generated it does not have the extra paren's. And query.rewrite(reader).toString() now gives the same result as query.toString(). All I can figure is I must have changed something between starting the email and sending it out. The other oddity is the performance degradation is not as apparent. I'm wondering if part of the problem is generating consistent data for comparing search performance. I do a warmup before actually running the test, but maybe it's not a good enough way to test. One additional thing - from an earlier suggestion - is it possible to add multiple terms per BooleanClause? I tried using TermQuery.combine() to add in an array of them into one query and making a clause from that, but there was no difference in performance. Brian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Help with Fuzzy Queries
Hi, I am new with Lucene. I dont understand how Lucene works in some cases. For example: If I have an index with the following three entries: - ATUAÇÃO FALHA DE DISJUNTOR - RESET DE FALHA DE DISJUNTOR - FALHA DE COMANDO When I try to look for something limilar with FALHA DE DISJUNTOR, I've got the following results: Result | score FALHA DE COMANDO | 0.9277342 ATUAÇÃO FALHA DE DISJUNTOR | 0.8880876 RESET DE FALHA DE DISJUNTOR | 0.5709133 If you pay attention, FALHA DE COMANDO is comming before ATUAÇÃO FALHA DE DISJUNTOR, but it is not what I would like. For my client, the first result should be ATUAÇÃO FALHA DE DISJUNTOR. Other example happens when I try to look for FALHA DISJUNTOR. For my surprise, the unique result is FALHA DE COMANDO. Like the other example, I was expecting ATUAÇÃO FALHA DE DISJUNTOR as the first or unique result. What should I do in order to have the ATUAÇÃO FALHA DE DISJUNTOR as the first result (or unique) in the explained cases. I am attaching the code at the end of this mail. Thanks in advance, Eloi public class TestLucene { private static final String FILENAME = test.idx; private static final String FIELD = FIELD; private static ListDocument getDocs() { ListDocument docs = new ArrayListDocument(); docs.add(toDoc(Atuação Falha de Disjuntor)); docs.add(toDoc(Reset de Falha de Disjuntor)); docs.add(toDoc(Falha de Comando)); return docs; } private static Document toDoc(String text) { Document doc = new Document(); doc.add(new Field(FIELD, text.toUpperCase(), Field.Store.YES, Field.Index.UN_TOKENIZED)); return doc; } private static void createIndex(Collection docs) throws IOException { IndexWriter writer = new IndexWriter(FILENAME, new StandardAnalyzer(), true); Iterator itDocs = docs.iterator(); while (itDocs.hasNext()) { Document doc = (Document) itDocs.next(); writer.addDocument(doc); } writer.optimize(); writer.close(); } public static void main(String[] args) throws IOException { createIndex(getDocs()); IndexSearcher indexSearcher = new IndexSearcher(FILENAME); Hits hits = indexSearcher.search(new FuzzyQuery(new Term(FIELD,Falha de Disjuntor.toUpperCase()), 0.4f)); System.out.println(hits.length()); Iterator it = hits.iterator(); while (it.hasNext()) { Hit hit = (Hit) it.next(); System.out.println(hit.get(FIELD) + + hit.getScore()); } } }
Re: Swapping between indexes
Sridhar, We have been using approach 2 in our production system with good results. We have separate processes for indexing and searching. The main issue that came up was in deleting old indexes (see: *http://tinyurl.com/32q8c4*). Most of our production problems occur during indexing, and we are able to fix these without having to interrupt searching at all. This has been a real benefit. Peter On Thu, Mar 6, 2008 at 5:30 AM, Sridhar Raman [EMAIL PROTECTED] wrote: This is my situation. I have an index, which has a lot of search requests coming into it. I use just a single instance of IndexSearcher to process these requests. At the same time, this index is also getting updated by an IndexWriter. And I want these new changes to be reflected _only_ at certain intervals. I have thought of a few ways of doing this. Each has its share of problems and pluses. I would be glad if someone can help me in figuring out the right approach, especially from the performance point of view, as the number of documents that will get indexed are pretty large. Approach 1: Have just one copy of the index for both Search Index. At time T, when I need to see the new changes reflected, I close the Searcher, and open it again. - The re-open of the Searcher might be a bit slow (which I could probably solve by using some warm-up threads). - Update and Search on the index at the same - will this affect the performance? - If server crashes before time T, the new Searcher would reflect the changes, which is not acceptable. I want the changes to be reflected only at time T. If server crashes, the index should be the previous T-1 index. - Possible problems while optimising the index (as Search is also happening). + Just one copy of the index being stored. Approach 2: Keep 2 copies of the index - 1 for Search, 1 for Index. At time T, I just switch the Searcher to a copy of index that is being updated. - Before I do the switch to the new index, I need to make a copy of it so that the updates continue to happen on the other index. Is there a convenient way to make this copy? Is it efficient? - Time taken to create a new Searcher will still be a problem (but this is a problem in the previous approach as well, and we can live with it). + Optimise can happen on an index that is not being read, as a result, its resource requirements would be lesser. And probably even the speed of optimisation. + Faster search as the index update is happening on a different index. So, these are the 2 approaches I am contemplating about. Any pointers which would be the better approach? Thanks, Sridhar
Re: MultiSearcher to overcome the Integer.MAX_VALUE limit
Well, I'm not sure. But any index, even one split amongst many nodes is going to have some interesting performance characteristics if you have over 2 billion documents So I'm not sure it matters G... What problem are you really trying to solve? You'll probably get more meaningful answers if you tell us what that is. Best Erick On Thu, Mar 6, 2008 at 10:23 AM, Ray [EMAIL PROTECTED] wrote: Hey Guys, just a quick question to confirm an assumption I have. Is it correct that I can have around 100 Indexes each at its Integer.MAX_VALUE limit of documents, but can happily search them all with a MultiSearcher if all combined returned hits don't add up to the Integer.MAX_VALUE themselves ? Kind regards, Ray.
Re: MultiSearcher to overcome the Integer.MAX_VALUE limit
Thanks for your answer. Well I want to search around 6 billion documents. Most of them very small, but I am confident to be hitting that number in the long run. I am currently running a small random text indexer with 400 docs/second. It will reach 2 billion in around 45 days. I really hope you all who are saying 2 billion docs will bring lucene to its knees are wrong... Ray. - Original Message - From: Erick Erickson [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, March 06, 2008 10:40 PM Subject: Re: MultiSearcher to overcome the Integer.MAX_VALUE limit Well, I'm not sure. But any index, even one split amongst many nodes is going to have some interesting performance characteristics if you have over 2 billion documents So I'm not sure it matters G... What problem are you really trying to solve? You'll probably get more meaningful answers if you tell us what that is. Best Erick On Thu, Mar 6, 2008 at 10:23 AM, Ray [EMAIL PROTECTED] wrote: Hey Guys, just a quick question to confirm an assumption I have. Is it correct that I can have around 100 Indexes each at its Integer.MAX_VALUE limit of documents, but can happily search them all with a MultiSearcher if all combined returned hits don't add up to the Integer.MAX_VALUE themselves ? Kind regards, Ray. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to create segments files?
Ladies and Gentlemen: Below is an exception and the source code that generates it: ERROR opening the Index - contact sysadmin! Error message: no segments* file found in org.apache.lucene.store.FSDirectory@/home/hdiwan/public_html/Q4D: files: Stack Trace follows... org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:587) org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) org.apache.lucene.index.IndexReader.open(IndexReader.java:209) org.apache.lucene.index.IndexReader.open(IndexReader.java:173) org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:48) org.apache.jsp.results_jsp._jspService(results_jsp.java:130) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) javax.servlet.http.HttpServlet.service(HttpServlet.java:803) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:374) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:337) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) javax.servlet.http.HttpServlet.service(HttpServlet.java:803) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) java.lang.Thread.run(Thread.java:619) -- Source code follows -- import java.io.File; import java.io.FilenameFilter; import java.io.IOException; import java.net.URLDecoder; import java.util.Collection; import java.util.Collections; import java.util.Comparator; import java.util.Date; import java.util.HashSet; import java.util.Vector; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.DateTools; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.store.FSDirectory; public class Parser implements Runnable,Comparator { String path; public Parser(String string) { path = string; } public void run() { IndexWriter writer = null; Directory directory = null; try { directory = FSDirectory.getDirectory(this.path); } catch (IOException e) { System.err.println(e.getStackTrace()); } try { writer = new IndexWriter(directory,new WhitespaceAnalyzer(), true); } catch (CorruptIndexException e) { e.printStackTrace(); } catch (LockObtainFailedException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } Document doc = new Document(); File image = null; File file = null; for (File f : this.listFiles(new File(path))) { if (f.getAbsolutePath().endsWith(xml) || f.getAbsolutePath().endsWith(q4d)) { System.err.println(Q4D file found!); file = f; } else { image = f; } if ((f != null) (image != null)) break; } Date lastModified = new Date(file.lastModified()); System.err.println(Found a file and its corresponding image!); String imageName = image.getName(); String filename = file.getName(); String lastModifiedDownToSecond = DateTools.dateToString(lastModified, DateTools.Resolution.SECOND); System.err.println(the time the file was last modified was +lastModifiedDownToSecond); String author = System.getProperty(author); String source = System.getProperty(source);