Lucene 1.3 final to 1.4final problem
Hey Dev Guys Apologies I have a Quick Problem... The no of Hits on set of Documents indexed using 1.3-final is not same on 1.4-final version [ The only modification done to the src is , I have upgraded my CustomAnalyzer on basis of StopAnalyzer avaliable in 1.4 ] Does doing this effect the performance. Some body please explain. with regards Karthik -Original Message- From: Alex Aw Seat Kiong [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 9:50 AM To: Lucene Users List Subject: upgrade from Lucene 1.3 final to 1.4rc3 problem Hi! I'm using Lucene 1.3 final currently, all things were working fine. But, after i'm upgraded from Lucene 1.3 final to 1.4rc3 (simply overwrite the lucene-1.4-final.jar to lucene-1.4-rc3.jar and re-compile it) We can re-compile it successfuly. but when will try to index the document. It give the error as below: java.lang.NullPointerException at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:146) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:126) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) Which wrong? Pls help. Thanks. Regards, Alex - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene 1.3 final to 1.4final problem
Hey Dev Guys Apologies Can Some body Explain me Why for an I/P word TA to the StopAnalyzer.java returns [ta] instead of [ta] TA == [ta] instead of [ta] $125.96 === [125.95] instead of [$125.95] Is it something wrong I have been missing. with regards Karthik -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Thursday, July 08, 2004 11:59 AM To: Lucene Users List Subject: Lucene 1.3 final to 1.4final problem Hey Dev Guys Apologies I have a Quick Problem... The no of Hits on set of Documents indexed using 1.3-final is not same on 1.4-final version [ The only modification done to the src is , I have upgraded my CustomAnalyzer on basis of StopAnalyzer avaliable in 1.4 ] Does doing this effect the performance. Some body please explain. with regards Karthik -Original Message- From: Alex Aw Seat Kiong [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 9:50 AM To: Lucene Users List Subject: upgrade from Lucene 1.3 final to 1.4rc3 problem Hi! I'm using Lucene 1.3 final currently, all things were working fine. But, after i'm upgraded from Lucene 1.3 final to 1.4rc3 (simply overwrite the lucene-1.4-final.jar to lucene-1.4-rc3.jar and re-compile it) We can re-compile it successfuly. but when will try to index the document. It give the error as below: java.lang.NullPointerException at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:146) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:126) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) Which wrong? Pls help. Thanks. Regards, Alex - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: boolean operators and score
If i do it by sorting the input before sending it to lucene, it could become unmanageable to handle and could also throw unexpected results for the user. e.g . if i type: winston churchill and world war and germany i could split the string by and and get the sorted string as (churchill winston) and (germany) and (war world) . this would obviously make the hits.score throw up unexpected results. isnt there any other solution which comes from lucene itself ? i am using 1.4 final Regards, Niraj
Re: boolean operators and score
Niraj Alok wrote: Hi Guys, Finally I have sorted the problem of hits score thanks to the great help of Franck. I have hit another problem with the boolean operators now. When I search for Winston and churchill i get a set of perfectly acceptable results. But when I change the order, churchill and winston the results are the same but the order of the results changes. I don't it is interpreted as the same request. As you should know the terms of a boolean query have a 'required' flag. As to me, your request 'winston and churchill' is interpreted as 'winston (not required)' and 'churchill (required)' But your request 'churchill and winston' is interpreted as 'churchill (not required)' and 'winston (required)' I think you'd rather search for 'and winston and churchill' (which should be the same than 'and churchill and winston') to have the both terms required Franck Is it possible to have the same order (hits.score) irrespective of which term is given before or after? Regards, Niraj -- Franck Brisbart RD http://www.kelkoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi John, The source code is available from CVS, make it non-final and do what you need to do. Of course, you may have a hard time finding help later if you aren't using something everyone else is and your solution doesn't work... :-) If I understand correctly what you are trying to do, you already know all of the answers for indexing, you just want Lucene to do the retrieval side of the coin, correct? I suppose a crazy idea might be to write a program that took your info and output it in the Lucene file format, but that seems a bit like overkill. -Grant [EMAIL PROTECTED] 07/07/04 07:37PM Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: boolean operators and score
What could actually be done is perhaps sort the search result by document id. Of course your relevancy will be all shot, but at least you would have control over the sorting order. At 09:05 AM 07/07/2004, you wrote: Hi Guys, Finally I have sorted the problem of hits score thanks to the great help of Franck. I have hit another problem with the boolean operators now. When I search for Winston and churchill i get a set of perfectly acceptable results. But when I change the order, churchill and winston the results are the same but the order of the results changes. Is it possible to have the same order (hits.score) irrespective of which term is given before or after? Regards, Niraj Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
RE: Problem with match on a non tokenized field.
Thanks a lot for your help. I have one more question: How would you handle a query consisting of two fields combined with a Boolean operator, where one field is only indexed and stored (a Keyword) and another is tokenized, indexed and store ? Is it possible to have parts of the same query analyzed with different analyzers ? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: July 7, 2004 4:38 PM To: [EMAIL PROTECTED] Subject: RE: Problem with match on a non tokenized field. Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper Here is how I use it: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer(url, new NullAnalyzer()); try { query = QueryParser.parse(searchQuery, contents, analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 4:20 PM To: [EMAIL PROTECTED] Subject: Problem with match on a non tokenized field. I have a Lucene Document with a field named Code which is stored and indexed but not tokenized. The value of the field is ABC5-LB. The only way I can match the field when searching is by entering Code:ABC5-LB because when I drop the quotes, every Analyzer I've tried using breaks my query into Code:ABC5 -Code:LB. I need to be able to match this field by doing something like Code:ABC5-L*, therefore always using quotes is not an option. How would I go about writing my own analyzer that will not tokenize the query ? Thanks, Polina - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi Grant: Thanks for the options. How likely will the lucene file formats change? Are there really no more optiosn? :(... Thanks -John On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: Hi John, The source code is available from CVS, make it non-final and do what you need to do. Of course, you may have a hard time finding help later if you aren't using something everyone else is and your solution doesn't work... :-) If I understand correctly what you are trying to do, you already know all of the answers for indexing, you just want Lucene to do the retrieval side of the coin, correct? I suppose a crazy idea might be to write a program that took your info and output it in the Lucene file format, but that seems a bit like overkill. -Grant [EMAIL PROTECTED] 07/07/04 07:37PM Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi Grant: I have something that would extract only the important words from a document along with its importance, furthermore, these important words may not be physically in the document, it could be synonyms to some of the words in the document. So the output of a process for a document is a list of word/importance pairs. I want to be able to query using only these words on the document. I don't think Lucene has such capability. Can you suggest what I can do with the analysers process in doing this without replicating words/tokens? Thanks -John On Thu, 08 Jul 2004 11:10:07 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: Hey John, Those are just options, didn't say they were good ones! :-) I guess the real question is, what is the background of what you are trying to do? Presumably you have some other program that is generating frequencies for you, do you really need that in the current form? Can't the Lucene indexing engine act as a stand-in for this process since your end result _should_ be the same? The Lucene Analyzer process is quite flexible, I bet you could even find a way to hook in your existing tools into the Analyzer process. -Grant [EMAIL PROTECTED] 07/08/04 10:42AM Hi Grant: Thanks for the options. How likely will the lucene file formats change? Are there really no more optiosn? :(... Thanks -John On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: Hi John, The source code is available from CVS, make it non-final and do what you need to do. Of course, you may have a hard time finding help later if you aren't using something everyone else is and your solution doesn't work... :-) If I understand correctly what you are trying to do, you already know all of the answers for indexing, you just want Lucene to do the retrieval side of the coin, correct? I suppose a crazy idea might be to write a program that took your info and output it in the Lucene file format, but that seems a bit like overkill. -Grant [EMAIL PROTECTED] 07/07/04 07:37PM Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
Re: Way to repair an index broking during 1/2 optimize?
You might try merging the existing index into a new index located on a ram disk. Once it is done, you can move the directory from ram disk back to your hard disk. I think this will work as long as the old index did not finish merging. You might do a strings command on the segments file to make sure the new (merged) segment is not in there, and if there's a deletable file, make sure there are no segments from the old index listed therein. - Original Message - From: Kevin A. Burton [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, July 08, 2004 2:02 PM Subject: Way to repair an index broking during 1/2 optimize? So.. the other day I sent an email about building an index with 14M documents. That went well but the optimize() was taking FOREVER. It took 7 hours to generate the whole index and when complete as of 10AM it was still optimizing (6 hours later) and I needed the box back. So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Also... what can I do to speed up this optimize? Ideally it wouldn't take 6 hours. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problem running lucene 1.4 demo on a solaris machine (permission denied)
Hello I have downloaded the lucene 1.4 to a windows machine, and it all works fine, when i tries to move this to a solaris machine i get the following error: /opt/tomcat/common/lib/lucene-1.4-final.jar: cannot execute If i then tries to change the permission (777) on the above file, i get the following error: /opt/tomcat/common/lib/lucene-1.4-final.jar: PK^C^D: not found /opt/tomcat/common/lib/lucene-1.4-final.jar: \304U\3410: not found /opt/tomcat/common/lib/lucene-1.4-final.jar: syntax error at line 3: `(' unexpected any ideas how to solve this, or what causes the error I am runinng in the following environment: java version 1.2.2 Solaris VM (build Solaris_JDK_1.2.2_10, native threads, sunwjit) but have tried on a java version 1.4.2 (i believe it was), but with the same error. When i copied the lucene jar file to the solaris machine from the windows machine i used a ftp program. Any help is much appreciated. Best regards, Mats Lindberg
Re: Way to repair an index broking during 1/2 optimize?
Kevin A. Burton wrote: So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Sorry, I forgot to answer your question: this should work fine. I don't think you should even have to delete that segment. Also, to elaborate on my previous comment, a mergeFactor of 5000 not only delays the work until the end, but it also makes the disk workload more seek-dominated, which is not optimal. So I suspect a smaller merge factor, together with a larger minMergeDocs, will be much faster overall, including the final optimize(). Please tell us how it goes. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem running lucene 1.4 demo on a solaris machine (permission denied)
MATL (Mats Lindberg) wrote: When i copied the lucene jar file to the solaris machine from the windows machine i used a ftp program. FTP probably mangled the file. You need to use FTP's binary mode. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Peter M Cipollone wrote: You might try merging the existing index into a new index located on a ram disk. Once it is done, you can move the directory from ram disk back to your hard disk. I think this will work as long as the old index did not finish merging. You might do a strings command on the segments file to make sure the new (merged) segment is not in there, and if there's a deletable file, make sure there are no segments from the old index listed therein. Its a HUGE index. It won't fit in memory ;) Right now its at 8G... Thanks though! :) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Kevin A. Burton wrote: Also... what can I do to speed up this optimize? Ideally it wouldn't take 6 hours. Was this the index with the mergeFactor of 5000? If so, that's why it's so slow: you've delayed all of the work until the end. Indexing on a ramfs will make things faster in general, however, if you have enough RAM... No... I changed the mergeFactor back to 10 as you suggested. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Kevin A. Burton wrote: So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Sorry, I forgot to answer your question: this should work fine. I don't think you should even have to delete that segment. I'm worried about duplicate or missing content from the original index. I'd rather rebuild the index and waste another 6 hours (I've probably blown 100 hours of CPU time on this already) and have a correct index :) During an optimize I assume Lucene starts writing to a new segment and leaves all others in place until everything is done and THEN deletes them? Also, to elaborate on my previous comment, a mergeFactor of 5000 not only delays the work until the end, but it also makes the disk workload more seek-dominated, which is not optimal. The only settings I uses are: targetIndex.mergeFactor=10; targetIndex.minMergeDocs=1000; the resulting index has 230k files in it :-/ I assume this is contributing to all the disk seeks. So I suspect a smaller merge factor, together with a larger minMergeDocs, will be much faster overall, including the final optimize(). Please tell us how it goes. This is what I did for this last round but then I ended up with the highly fragmented index. hm... Thanks for all the help btw! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Understanding TooManyClauses-Exception and Query-RAM-size
[EMAIL PROTECTED] wrote: Hi, a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went smoothly, but we are experiencing some problems with that new constant limit maxClauseCount=1024 which leeds to Exceptions of type org.apache.lucene.search.BooleanQuery$TooManyClauses when certain RangeQueries are executed (in fact, we get this Excpetion when we execute certain Wildcard queries, too). Although we are working with a fairly small index with about 35.000 documents, we encounter this Exception when we search for the property modificationDate. For example modificationDate:[00 TO 0dwc970kw] We talked about this the other day. http://wiki.apache.org/jakarta-lucene/IndexingDateFields Find out what type of precision you need and use that. If you only need days or hours or minutes then use that. Millis is just too small. We're only using days and have queries for just the last 7 days as max so this really works out well... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Thanks Doug. I will do just that. Just for my education, can you maybe elaborate on using the implement an IndexReader that delivers a synthetic index approach? Thanks in advance -John On Thu, 08 Jul 2004 10:01:59 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. That's easy to fix. We just need to reuse the token: public class VectorTokenStream extends TokenStream { private int term = -1; private int freq = 0; private Token token; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; token = new Token(terms[term], 0, 0); freq = freqs[term]; } freq--; return token; } } Then only two tokens are created, as you desire. If you for some reason don't want to create a dummy document stream, then you could instead implement an IndexReader that delivers a synthetic index for a single document. Then use IndexWriter.addIndexes() to turn this into a real, FSDirectory-based index. However that would be a lot more work and only very marginally faster. So I'd stick with the approach I've outlined above. (Note: this code has not been compiled or run. It may have bugs.) Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Kevin A. Burton wrote: No... I changed the mergeFactor back to 10 as you suggested. Then I am confused about why it should take so long. Did you by chance set the IndexWriter.infoStream to something, so that it logs merges? If so, it would be interesting to see that output, especially the last entry. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Otis Gospodnetic wrote: Hey Kevin, Not sure if you're aware of it, but you can specify the lock dir, so in your example, both JVMs could use the exact same lock dir, as long as you invoke the VMs with the same params. Most people won't do this or won't even understand WHY they need to do this :-/. You shouldn't be writing the same index with more than 1 IndexWriter though (not sure if this was just a bad example or a real scenario). Yes... I realize that you shouldn't use more than one IndexWriter. That was the point. The locks are to prevent this from happening. If one were to accidentally do this the locks would be in different directories and our IndexWriter would corrupt the index. This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Kevin A. Burton wrote: No... I changed the mergeFactor back to 10 as you suggested. Then I am confused about why it should take so long. Did you by chance set the IndexWriter.infoStream to something, so that it logs merges? If so, it would be interesting to see that output, especially the last entry. No I didn't actually... If I run it again I'll be sure to do this. -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Kevin A. Burton wrote: This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. I think the bug is that Tomcat changes java.io.tmpdir. I thought that the point of the system property java.io.tmpdir was to have a portable name for /tmp on unix, c:\windows\tmp on Windows, etc. Tomcat breaks that. So must Lucene have its own way of finding the platform-specific temporary directory that everyone can write to? Perhaps, but it seems a shame, since Java already has a standard mechanism for this, which Tomcat abuses... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
John Wang wrote: Just for my education, can you maybe elaborate on using the implement an IndexReader that delivers a synthetic index approach? IndexReader is an abstract class. It has few data fields, and few non-static methods that are not implemented in terms of abstract methods. So, in effect, it is an interface. When Lucene indexes a token stream it creates a single-document index that is then merged with other single- and multi-document indexes to form an index that is searched. You could bypass the first step of this (indexing a token stream) by instead directly implementing all of IndexReader's abstract methods to return the same thing as the single-document index that Lucene would create. This would be marginally faster, as no Token objects would be created at all. But, since IndexReader has a lot of abstract methods, it would be a lot of work. I didn't really mean it as a practical suggestion. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Doug Cutting wrote: Kevin A. Burton wrote: This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. I think the bug is that Tomcat changes java.io.tmpdir. I thought that the point of the system property java.io.tmpdir was to have a portable name for /tmp on unix, c:\windows\tmp on Windows, etc. Tomcat breaks that. So must Lucene have its own way of finding the platform-specific temporary directory that everyone can write to? Perhaps, but it seems a shame, since Java already has a standard mechanism for this, which Tomcat abuses... I've seen this done in other places as well. I think Weblogic did/does it. I'm wondering what some of these big EJB containsers use which is why I brought this up. I'm not sure the problem is just with Tomcat. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Where's the search(Query query, Sort sort) method of Searcher
I'm trying to do a search and sort the results using a Sort object. The 1.4-final API says that Searcher has the following method. Hits search(Query query, Sort sort) However, when I try to use it in the code below: IndexSearcher is = new IndexSearcher(fsDir); Query query = QueryParser.parse(Nuggets, creator, new StandardAnalyzer()); Hits hits = is.search(query, new Sort(created)); I get the following compile error: [javac] Compiling 18 source files to /Users/bill/Nuggets/classes [javac] /Users/bill/Nuggets/src/com/otherwise/nuggets/MySearcher.java:44: cannot resolve symbol [javac] symbol : method search (org.apache.lucene.search.Query,org.apache.lucene.search.Sort) [javac] location: class org.apache.lucene.search.IndexSearcher [javac] hits = is.search(query, new Sort(created)); [javac] ^ If I do the same call without the Sort object it compiles just fine. This seems to be indicating the search(Query, Sort) method is not in the jar file. Either the API is in error (doubtful) or I'm doing something really stupid (likely). Can someone explain which it is? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Kevin A. Burton wrote: During an optimize I assume Lucene starts writing to a new segment and leaves all others in place until everything is done and THEN deletes them? That's correct. The only settings I uses are: targetIndex.mergeFactor=10; targetIndex.minMergeDocs=1000; the resulting index has 230k files in it :-/ Something sounds very wrong for there to be that many files. The maximum number of files should be around: (7 + numIndexedFields) * (mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs)) With 14M documents, log_10(14M/1000) is 4, which gives, for you: (7 + numIndexedFields) * 36 = 230k 7*36 + numIndexedFields*36 = 230k numIndexedFields = (230k - 7*36) / 36 =~ 6k So you'd have to have around 6k unique field names to get 230k files. Or something else must be wrong. Are you running on win32, where file deletion can be difficult? With the typical handful of fields, one should never see more than hundreds of files. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Something sounds very wrong for there to be that many files. The maximum number of files should be around: (7 + numIndexedFields) * (mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs)) With 14M documents, log_10(14M/1000) is 4, which gives, for you: (7 + numIndexedFields) * 36 = 230k 7*36 + numIndexedFields*36 = 230k numIndexedFields = (230k - 7*36) / 36 =~ 6k So you'd have to have around 6k unique field names to get 230k files. Or something else must be wrong. Are you running on win32, where file deletion can be difficult? With the typical handful of fields, one should never see more than hundreds of files. We only have 13 fields... Though to be honest I'm worried that even if I COULD do the optimize that it would run out of file handles. This is very strange... I'm going to increase minMergeDocs to 1 and then run the full converstion on one box and then try to do an optimize (of the corrupt) another box. See which one finishes first. I assume the speed of optimize() can be increased the same way that indexing is increased... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Browse by Letter within a Category
I would like to implement the following functionality: - Search a specific field (category) and limit the search where the title field begins with a given letter, and return the results sorted in alphabetical order by title. Both the category and title fields are tokenized, indexed and stored in the index (type Field.Text). How should I construct the search and sort? I tried the following, but the titles are not being displayed in alphabetical order: Searcher.search(category:\Products\ AND title:\A*\, new Sort(title)); I want to display all results where Products is the category whose title begins with the letter A, sorted in alphabetical order by title. I'm using Lucene 1.4 final release. Thanks, Tom
Re: boolean operators and score
Hi Don, After months of struggling with lucene and finally achieving the complex relevancy desired, the client would kill me if i now make that relevancy all lost. I am trying to do it with the way Franck suggested by sorting the words the user has entered, but otherwise, isn't this a bug of lucene ? Regards, Niraj - Original Message - From: Don Vaillancourt [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, July 08, 2004 7:15 PM Subject: Re: boolean operators and score What could actually be done is perhaps sort the search result by document id. Of course your relevancy will be all shot, but at least you would have control over the sorting order.