Re: how to estimate how much memory is required to support the large index search
BTW, upcoming changes in Lucene for flexible indexing should improve the RAM usage of the terms index substantially: https://issues.apache.org/jira/browse/LUCENE-1458 In the current (first) iteration on that patch, TermInfo is no longer used at all when loading the index. I think for a typical index this will likely cut in half the RAM used by the terms index. But... this won't be available for some time (it's still a work in progress). Mike Chris Lu wrote: So looks like you are not really doing much sorting? This index divisor affects reader.terms(), but not too much with sorting. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Mon, Nov 17, 2008 at 6:21 PM, Zhibin Mai [EMAIL PROTECTED] wrote: It is a cache tunning setting in IndexReader. It can be set via method setTermInfosIndexDivisor(int). Thanks, Zhibin From: Chris Lu [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Monday, November 17, 2008 7:07:21 PM Subject: Re: how to estimate how much memory is required to support the large index search Calculation looks right. But what's the Index divisor that you mentioned? -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Mon, Nov 17, 2008 at 5:00 PM, Zhibin Mai [EMAIL PROTECTED] wrote: Aleksander, I figured it out that most of heap was consumed by the Term cache. In our case, the index has 233 millions of terms and 6.4 millions of them were loaded into the cache when we did the search. I roughly did a calculation that each term will need how much memory, it is about 16 bytes for Term Object + 32 bytes for TermInfo Object + 24 bytes for String Object for term text + 2 * length(Char[]) for term text. In our case, the average length of term text is 25 characters, that means each term needs at least 122 bytes. The cache for 6.4 millions of terms needs 6.4 * 122 = 780MB. Plus 200MB for caching norm, the RAM for cache is larger than 980MB. We work around the cache issue for Terms by setting index divisor of the IndexReader to a higher value. Actually, the performance of search is good even using index divisor as 4. Thanks, Zhibin From: Aleksander M. Stensby [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Monday, November 17, 2008 2:31:04 AM Subject: Re: how to estimate how much memory is required to support the large index search One major factor that may result in heap space problems is if you are doing any form of sorting when searching. Do you have any form of default sort in your application? Also, the type of field used for sorting is important with regard to memory consumption. This issue has been discussed before on the list. (You can search the archive for sorting and memory consumption.) - Aleksander On Sun, 16 Nov 2008 14:36:36 +0100, Zhibin Mai [EMAIL PROTECTED] wrote: Hello, I am a beginner on using lucene. We developed an application to create and search index using lucene 2.3.1. We would like to know how to estimate how much memory is required to support the index search given an index. Recently, the size of the index has reached to about 200GB with 197M of documents and 223M of terms. Our application starts having intermittent OutOfMemoryError: Java heap space when we use it to search the index. We use JProfiler to get the following memory allocation when we do one keyword search: char[]332MB org.apache.lucene.index.TermInfo194MB java.lang.String146MB org.apache.lucene.index.Term99,823KB org.apache.lucene.index.Term24,956KB org.apache.lucene.index.TermInfo[]24,956KB byte[]188MB long[]49,912KB The memory allocation for the first 6 types of objects does not change when we change the search criteria. Could you please give me some advice what major factors will affect the memory allocation and how those factors will affect the memory usage precisely on search? Is it possible to reduce the memory usage on search? Thank you, Zhibin --Aleksander M. Stensby Senior software
Re: Lucene 2.4 Token Stream error
Can you post the code fragment in AccentFilter.java that's setting the Token? In 2.4 we added that check (for IllegalArgumentException) to ensure you don't setTermLength to something longer than the current term buffer. You should call resizeTermBuffer() first, then fill in the char[] for the token, then call setTermLength. Mike bhupesh bansal wrote: Hey folks, I saw this error in my code base after upgrading lucene-2.4 from lucene 2.3. have folks seen this before and any idea ?? is it related to fix of https://issues.apache.org/jira/browse/LUCENE-1333 java.lang.IllegalArgumentException: length 11 exceeds the size of the termBuffer (10) at org.apache.lucene.analysis.Token.setTermLength(Token.java: 526) at com .linkedin .search.pub.stemming.impl.filter.AccentFilter.next(AccentFilter.java: 42) at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java: 34) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47) at com .linkedin .search .pub.stemming.impl.filter.PushbackFilter.next(PushbackFilter.java:52) at com .linkedin .search .pub .stemming.impl.filter.rewrite.RewriteFilter.next(RewriteFilter.java: 58) at com .linkedin .search .pub .stemming.impl.filter.rewrite.RewriteFilter.next(RewriteFilter.java: 70) at com .linkedin .search .pub .stemming.impl.filter.rewrite.RewriteFilter.next(RewriteFilter.java: 39) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java: 120) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47) -- View this message in context: http://www.nabble.com/Lucene-2.4-Token-Stream-error-tp20550488p20550488.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reopen IndexReader
Did you create your IndexSearcher using a String or File (not Directory)? If so, it sounds like you are hitting this issue (just fixed this morning, on 2.9-dev (trunk)): https://issues.apache.org/jira/browse/LUCENE-1453 The workaround is to use the Directory ctor of IndexSearcher. Mike Ganesh wrote: Hello all, I am using version 2.4. The following code throws AlreadyClosedException IndexReader reader = searcher.getIndexReader(); IndexReader newReader = reader.reopen(); if (reader != newReader) { reader.close(); boolean isCurrent = newReader.isCurrent(); //throws exception } Full list of exception: org.apache.lucene.store.AlreadyClosedException: this Directory is closed at org.apache.lucene.store.Directory.ensureOpen(Directory.java: 220) at org.apache.lucene.store.FSDirectory.list(FSDirectory.java: 320) at org.apache.lucene.index.SegmentInfos $FindSegmentsFile.run(SegmentInfos.java:533) at org .apache .lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366) at org .apache .lucene .index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188) at MailIndexer.IndexSearcherEx.reOpenDB(IndexSearcherEx.java: 102) Please correct me, if i am wrong. Regards Ganesh Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss
Hi, what is the best to transform the german umlaute ö,ä,ü,ß into oe, ae, ue, ss during the process of analyzing? Thanks, Sascha Fahl Softwareentwicklung evenity GmbH Zu den Mühlen 19 D-35390 Gießen Mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Transforming german umlaute like ö,ä,ü ,ß into oe, ae, ue, ss
Use ISOLatin1AccentFilter, although it is not perfect... So I made ISOLatin2AccentFilter for me and changed this method. We use our own analysers, so you would use something like this result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader); result = new ISOLatin2AccentFilter(result); result = new org.apache.lucene.analysis.LowerCaseFilter(result); * To replace accented characters in a String by unaccented equivalents. */ public final static String removeAccents(String input) { final StringBuffer output = new StringBuffer(); for (int i = 0; i input.length(); i++) { switch (input.charAt(i)) { case '\u00C0' : // À case '\u00C1' : // Á case '\u00C2' : //  case '\u00C3' : // à case '\u00C5' : // Å output.append(A); break; case '\u00C4' : // Ä case '\u00C6' : // Æ output.append(AE); break; case '\u00C7' : // Ç output.append(C); break; case '\u00C8' : // È case '\u00C9' : // É case '\u00CA' : // Ê case '\u00CB' : // Ë output.append(E); break; case '\u00CC' : // Ì case '\u00CD' : // Í case '\u00CE' : // Î case '\u00CF' : // Ï output.append(I); break; case '\u00D0' : // Ð output.append(D); break; case '\u00D1' : // Ñ output.append(N); break; case '\u00D2' : // Ò case '\u00D3' : // Ó case '\u00D4' : // Ô case '\u00D5' : // Õ case '\u00D8' : // Ø output.append(O); break; case '\u00D6' : // Ö case '\u0152' : // Œ output.append(OE); break; case '\u00DE' : // Þ output.append(TH); break; case '\u00D9' : // Ù case '\u00DA' : // Ú case '\u00DB' : // Û output.append(U); break; case '\u00DC' : // Ü output.append(UE); break; case '\u00DD' : // Ý case '\u0178' : // Ÿ output.append(Y); break; case '\u00E0' : // à case '\u00E1' : // á case '\u00E2' : // â case '\u00E3' : // ã case '\u00E5' : // å output.append(a); break; case '\u00E4' : // ä case '\u00E6' : // æ output.append(ae); break; case '\u00E7' : // ç output.append(c); break; case '\u00E8' : // è case '\u00E9' : // é case '\u00EA' : // ê case '\u00EB' : // ë output.append(e); break; case '\u00EC' : // ì case '\u00ED' : // í
Re: Reopen IndexReader
Well... we certainly do our best to have each release be stable, but we do make mistakes, so you'll have to use your own judgement on when to upgrade. However, it's only through users like yourself upgrading that we then find fix any uncaught issues in each new release. Mike Ganesh wrote: I am creating IndexSearcher using String, this is working fine with version 2.3.2. I tried by replacing Directory ctor of IndexSearcher and it is working fine with v2.4. I have recently upgraded from v2.3.2 to 2.4. Is v2.4 stable and i could more forward with this or shall i revert back to 2.3.2? Regards Ganesh - Original Message - From: Michael McCandless [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, November 18, 2008 4:59 PM Subject: Re: Reopen IndexReader Did you create your IndexSearcher using a String or File (not Directory)? If so, it sounds like you are hitting this issue (just fixed this morning, on 2.9-dev (trunk)): https://issues.apache.org/jira/browse/LUCENE-1453 The workaround is to use the Directory ctor of IndexSearcher. Mike Ganesh wrote: Hello all, I am using version 2.4. The following code throws AlreadyClosedException IndexReader reader = searcher.getIndexReader(); IndexReader newReader = reader.reopen(); if (reader != newReader) { reader.close(); boolean isCurrent = newReader.isCurrent(); //throws exception } Full list of exception: org.apache.lucene.store.AlreadyClosedException: this Directory is closed at org.apache.lucene.store.Directory.ensureOpen(Directory.java: 220) at org.apache.lucene.store.FSDirectory.list(FSDirectory.java: 320) at org.apache.lucene.index.SegmentInfos $FindSegmentsFile.run(SegmentInfos.java:533) at org .apache .lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java: 366) at org .apache .lucene .index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java: 188) at MailIndexer.IndexSearcherEx.reOpenDB(IndexSearcherEx.java: 102) Please correct me, if i am wrong. Regards Ganesh Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss
Uwe Goetzke wrote: Use ISOLatin1AccentFilter, although it is not perfect... So I made ISOLatin2AccentFilter for me and changed this method. Or use CharFilter library. It is for Solr as of now, though. See: https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG https://issues.apache.org/jira/browse/SOLR-822 Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: Transforming german umlaute like ö,ä, ü,ß into oe, ae, ue, ss
Where do I get the CharFilter library? I'm using Lucene, not Solr. Thanks, Sascha Am 18.11.2008 um 14:11 schrieb Koji Sekiguchi: Uwe Goetzke wrote: Use ISOLatin1AccentFilter, although it is not perfect... So I made ISOLatin2AccentFilter for me and changed this method. Or use CharFilter library. It is for Solr as of now, though. See: https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG https://issues.apache.org/jira/browse/SOLR-822 Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Sascha Fahl Softwareentwicklung evenity GmbH Zu den Mühlen 19 D-35390 Gießen Mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how to estimate how much memory is required to support the large index search
You are right. Cheers, Zhibin From: Chris Lu [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Monday, November 17, 2008 11:13:44 PM Subject: Re: how to estimate how much memory is required to support the large index search So looks like you are not really doing much sorting? This index divisor affects reader.terms(), but not too much with sorting. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Mon, Nov 17, 2008 at 6:21 PM, Zhibin Mai [EMAIL PROTECTED] wrote: It is a cache tunning setting in IndexReader. It can be set via method setTermInfosIndexDivisor(int). Thanks, Zhibin From: Chris Lu [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Monday, November 17, 2008 7:07:21 PM Subject: Re: how to estimate how much memory is required to support the large index search Calculation looks right. But what's the Index divisor that you mentioned? -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Mon, Nov 17, 2008 at 5:00 PM, Zhibin Mai [EMAIL PROTECTED] wrote: Aleksander, I figured it out that most of heap was consumed by the Term cache. In our case, the index has 233 millions of terms and 6.4 millions of them were loaded into the cache when we did the search. I roughly did a calculation that each term will need how much memory, it is about 16 bytes for Term Object + 32 bytes for TermInfo Object + 24 bytes for String Object for term text + 2 * length(Char[]) for term text. In our case, the average length of term text is 25 characters, that means each term needs at least 122 bytes. The cache for 6.4 millions of terms needs 6.4 * 122 = 780MB. Plus 200MB for caching norm, the RAM for cache is larger than 980MB. We work around the cache issue for Terms by setting index divisor of the IndexReader to a higher value. Actually, the performance of search is good even using index divisor as 4. Thanks, Zhibin From: Aleksander M. Stensby [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Monday, November 17, 2008 2:31:04 AM Subject: Re: how to estimate how much memory is required to support the large index search One major factor that may result in heap space problems is if you are doing any form of sorting when searching. Do you have any form of default sort in your application? Also, the type of field used for sorting is important with regard to memory consumption. This issue has been discussed before on the list. (You can search the archive for sorting and memory consumption.) - Aleksander On Sun, 16 Nov 2008 14:36:36 +0100, Zhibin Mai [EMAIL PROTECTED] wrote: Hello, I am a beginner on using lucene. We developed an application to create and search index using lucene 2.3.1. We would like to know how to estimate how much memory is required to support the index search given an index. Recently, the size of the index has reached to about 200GB with 197M of documents and 223M of terms. Our application starts having intermittent OutOfMemoryError: Java heap space when we use it to search the index. We use JProfiler to get the following memory allocation when we do one keyword search: char[]332MB org.apache.lucene.index.TermInfo194MB java.lang.String146MB org.apache.lucene.index.Term99,823KB org.apache.lucene.index.Term24,956KB org.apache.lucene.index.TermInfo[]24,956KB byte[]188MB long[]49,912KB The memory allocation for the first 6 types of objects does not change when we change the search criteria. Could you please give me some advice what major factors will affect the memory allocation and how those factors will affect the memory usage precisely on search? Is it possible to reduce the memory usage on search? Thank you, Zhibin --Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no
Re: AW: Transforming german umlaute like ö,ä ,ü,ß into oe, ae, ue, ss
Sascha Fahl wrote: Where do I get the CharFilter library? I'm using Lucene, not Solr. Thanks, Sascha CharFilter is included in recent Solr nightly build. It is not OOTB solution for Lucene now, sorry. If I have time, I will make it for Lucene in this weekend. Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Transforming german umlaute like ö,ä,ü ,ß into oe, ae, ue, ss
Naming this class to include Latin2 may be misleading. Latin2 means ISO-8859-2 character set. http://en.wikipedia.org/wiki/ISO_8859-2 From: Uwe Goetzke [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 18, 2008 7:26 AM To: java-user@lucene.apache.org Cc: [EMAIL PROTECTED] Subject: AW: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss Use ISOLatin1AccentFilter, although it is not perfect... So I made ISOLatin2AccentFilter for me and changed this method. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Special characters prevent entity being indexed
Hi! I'm having problems with entities including special characters (Spanish language) not getting indexed. I haven't been able to find the the reason why some entities get indexed while some don't. I have 3 fields that (currently) hold the same value. The value for the fields is example ¡Fantástico!- blaaba. Then when I change ONE of the three values to ¡Fantástico! - blaaba, the entity gets indexed. So chanching only one field makes it to index. But the bigger problem with this is, that I have almost (other fields are almost similar and I don't think they cause the problem) similar entity, with exactly the same three ¡Fantástico!- blaaba -fields and it gets indexed normally. Even though the critical fields are exactly the same. And also all entities where three fields start with upside down ?-mark doesn't get indexed. I'm really confused with the problem because I don't seem to be able to find any logic some entities not being indexed even though they are similar to some other. And changing only one value of the three makes it index. Sorry for a really messy message but I just can't explain it more clearly now. Thanks in advance, pn - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Special characters prevent entity being indexed
What analyzer are you using at index and search time? Typical problems include: using an analyzer that doesn't understand accented chars (StandardAnalyzer for instance) using a different anlyzer during search and index. Search the user list for accent and you'll find this kind of problem discussed, and if that doesn't help we need to know what analyzers you are using and what behavior you really want. Typically, for instance, *requiring* a user to type the upside-down exclamation point to get a match on this field would be considered incorrect. Also, you'd be helped a lot be getting a copy of Luke and examining your index to see exactly what's been indexed, it'll reveal a lot. Best Erick On Tue, Nov 18, 2008 at 10:05 AM, Pekka Nykyri [EMAIL PROTECTED]wrote: Hi! I'm having problems with entities including special characters (Spanish language) not getting indexed. I haven't been able to find the the reason why some entities get indexed while some don't. I have 3 fields that (currently) hold the same value. The value for the fields is example ¡Fantástico!- blaaba. Then when I change ONE of the three values to ¡Fantástico! - blaaba, the entity gets indexed. So chanching only one field makes it to index. But the bigger problem with this is, that I have almost (other fields are almost similar and I don't think they cause the problem) similar entity, with exactly the same three ¡Fantástico!- blaaba -fields and it gets indexed normally. Even though the critical fields are exactly the same. And also all entities where three fields start with upside down ?-mark doesn't get indexed. I'm really confused with the problem because I don't seem to be able to find any logic some entities not being indexed even though they are similar to some other. And changing only one value of the three makes it index. Sorry for a really messy message but I just can't explain it more clearly now. Thanks in advance, pn - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
compare scores across queries
Hi all, I am wondering if the raw scores obtained from HitCollector can be used to compare relevance of documents to different queries? E.g. two phrase queries are issued : (PQ1: Barack Obama and PQ2: John McCain). if a document (doc1) belongs to the result sets of both queries and has the raw score of 5 for PQ1 and 3 for PQ2, can I say that doc1 is more relevant to Barack Obama than to John McCain? There have been some previous discussions about this at [1,2]. On the other hand, the javadoc of the Similarity class says *queryNorm(q) * is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. Please advise. Thanks. Ng. [1] http://thread.gmane.org/gmane.comp.jakarta.lucene.user/10760/focus=10810 [2] http://www.gossamer-threads.com/lists/lucene/java-user/35051?search_string=compare%20score%20across%20queries;#35051 [3] http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
can I set Boost to the term while indexing?
I would like to store a set of keywords in a single field of a document. for example I have now three keywords: One, Two and Three and I am going to add them into a document. At first, is this code correct? // String[] keyword = new String[]{One, Two, Three}; for (int i = 0; i keyword.length; i++) { Field f = new Field(field_name, keyword[i], Field.Store.NO, Field.Index.UN_TOKENIZED, TermVector.YES); doc.add(f); } indexWriter.addDocument(doc); /***/ when searching, We can set Boost for a query term. the question is... Can I set Boost for every keyword/term while indexing? from the example above. I may set those keywords. i.e. One, Two and Three, with different Weight/Boost/Relavance... while indexing. and the same term may have different Weight/Boost/Relavance... in different document. can I do this? thanks. :-)
Searching repeating fields
Hello, I am designing an index in which one url corresponds to one document. Each document also contains multiple parallel repeating fields. For example: Document 1: url: http://www.cnn.com/ page_description: cnn breaking news page_title: news page_title: cnn news page_titel: homepage username: ajax username: paris username: daniel In this contrived example, user 'ajax' have saved the URL with the page title 'news', 'paris' has saved it with 'cnn news', and 'daniel' has saved it with 'homepage'. What I need to be able to do is perform a search for a particular user and a particular title, but they must occur together. For example, +user:ajax +page_title:news would return this document, but +user:ajax +page_title:homepage would not. I am open to changing the design of the document (i.e. using repeating fields isn't required), but I do need to have one document per url. I am looking for suggestions for a strategy on implementing this requirement. Thanks, Mark Ferguson
Re: Searching repeating fields
How about using variable field names? url: http://www.cnn.com/ page_description: cnn breaking news page_title_ajax: news page_title_paris: cnn news page_title_daniel: homepage username: ajax username: paris username: daniel and search for +user:ajax +page_title_ajax:news or maybe just page_title_ajax:news. Might not even need to store user. -- Ian. On Tue, Nov 18, 2008 at 5:48 PM, Mark Ferguson [EMAIL PROTECTED] wrote: Hello, I am designing an index in which one url corresponds to one document. Each document also contains multiple parallel repeating fields. For example: Document 1: url: http://www.cnn.com/ page_description: cnn breaking news page_title: news page_title: cnn news page_titel: homepage username: ajax username: paris username: daniel In this contrived example, user 'ajax' have saved the URL with the page title 'news', 'paris' has saved it with 'cnn news', and 'daniel' has saved it with 'homepage'. What I need to be able to do is perform a search for a particular user and a particular title, but they must occur together. For example, +user:ajax +page_title:news would return this document, but +user:ajax +page_title:homepage would not. I am open to changing the design of the document (i.e. using repeating fields isn't required), but I do need to have one document per url. I am looking for suggestions for a strategy on implementing this requirement. Thanks, Mark Ferguson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: constructing a mini-index with just the number of hits for a term
Flexible indexing (LUCENE-1458) should make this possible. IE you could use your own codec which discards doc/freq/prox/payload and during indexing (for this one field) and simply stores the term frequency in the terms dict. However, one problem will be deletions (in case it matters to your app): in order to properly update the terms dict counts, SegmentMerger walks through the docIDs for the term and skips the deleted ones. But it will be some time before this is real, though there's an initial patch on LUCENE-1458. Mike Grant Ingersoll wrote: Can you share what the actual problem is that you are trying to solve? It might help put things in context for me. I'm guessing you are doing some type of co-occurrence analysis, but... More below. On Nov 13, 2008, at 11:08 AM, Sven wrote: First - I apologize for the double post on my earlier email. The first time I sent it I received an error message from [EMAIL PROTECTED] saying that I should instead send email to [EMAIL PROTECTED] so I thought it did not go through. My question is this - is there a way to use the Lucene/Solr infrastructure to create a mini-index that simply contains a lookup table of terms and the number of times they have appeared? This could be possible. I think I would create documents with Index.ANALYZED, and Store.NO. Then, you just need to use the TermEnum and TermDocs to access the information that you need. In a sense, you are just creating the term dictionary. You could also turn off storing of NORMS, which will save too. I do not need to record which documents have them nor do I need to know where in the documents they appear. There could be (and probably will be) more than 2^32 terms, however. 2^32 unique terms or 2^32 total terms? I'm not sure if that makes a difference to the Lucene backend, but thought it might be relevant. This question coincides with my earlier question about counting the times a given term is associated with another term. I figure that this would be more easily accomplished by making the mini-index described above alongside the normal index when a document is indexed. For example, when scanning: Bravely bold Sir Robin, brought forth from Camelot. He was not afraid to die! Oh, brave Sir Robin! In addition to the normal indexing function of Lucene, I would like to write something on the backend to also index: bravely|bold bravely|sir bravely|robin bravely|brought bravely|forth bold|sir bold|robin bold|brought bold|forth bold|camelot (from being a stop word) ...and so on I only need to keep a running total of each bravely|bold term, however, since the number of terms will be quite large and keeping track of the document/termpositions would translate to a lot of wasted HD space. For this, I think you will have to hook into the Analyzer process. The other thing to do is just try keeping the document/term positions, it may not actually be as bad as you think in terms of space. If such a thing is not already in place, could someone let me know if there are some tutorials, documentation, or presentations that describe the inner workings of Lucene and the theories/ implementation at work for the actual file formats, structures, data manipulations, etc? (The javadocs don't go into this kind of detail.) I'm sure I can sift through the code and eventually make sense of it, but if there is documentation out there, I'd prefer to peruse that first. My thought being that I can simply generate my own kind of hash for each combined term and write it out to a custom file structure similar to Lucene - but the specifics of how to (optimally) do so are not plain to me yet. Thanks again! -Sven - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching repeating fields
Thanks for the suggestion, but I think I will need a more robust solution, because this will only work with pairs of fields. I should have specified that the example I gave was somewhat contrived, but in practice there could be more than two parallel fields. I'm trying to find a general solution that I can apply to any number of parallel fields holding any kind of data. I was thinking of trying something along the lines of a multi-value field. So for example, I could have: page_user_title: ajax|news (where | is a field separator) The problem is I don't know how to formulate the query that would be equivalent to +username:ajax +page_title:news, or if it's even possible. (I should also mention that I am creating the queries programmatically, not using the query parser, so anything goes). Any other ideas? Mark Ferguson On Tue, Nov 18, 2008 at 1:06 PM, Ian Lea [EMAIL PROTECTED] wrote: How about using variable field names? url: http://www.cnn.com/ page_description: cnn breaking news page_title_ajax: news page_title_paris: cnn news page_title_daniel: homepage username: ajax username: paris username: daniel and search for +user:ajax +page_title_ajax:news or maybe just page_title_ajax:news. Might not even need to store user. -- Ian. On Tue, Nov 18, 2008 at 5:48 PM, Mark Ferguson [EMAIL PROTECTED] wrote: Hello, I am designing an index in which one url corresponds to one document. Each document also contains multiple parallel repeating fields. For example: Document 1: url: http://www.cnn.com/ page_description: cnn breaking news page_title: news page_title: cnn news page_titel: homepage username: ajax username: paris username: daniel In this contrived example, user 'ajax' have saved the URL with the page title 'news', 'paris' has saved it with 'cnn news', and 'daniel' has saved it with 'homepage'. What I need to be able to do is perform a search for a particular user and a particular title, but they must occur together. For example, +user:ajax +page_title:news would return this document, but +user:ajax +page_title:homepage would not. I am open to changing the design of the document (i.e. using repeating fields isn't required), but I do need to have one document per url. I am looking for suggestions for a strategy on implementing this requirement. Thanks, Mark Ferguson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching repeating fields
I'll provide a better example, perhaps it will help in formulating a solution. Suppose I am designing an index that stores invoices. One document corresponds to one invoice, which has a unique id. Any number of employees can make comments on the invoices, and comments have different classifications (request_for_approval, redirection, approval, miscellaneous). Each comment is timestamped. An invoice also contains a long description that is indexed and is stored. So an example document may look like this: invoice_id: 1234 invoice_description:(some text) employee_id: 5 employee_id: 8 employee_id: 12 comment_type: request_for_approval comment_type: redirection comment_type: approval comment: please approve invoice comment: sending invoice to sales comment: invoice approved ts:200811181012 ts:200811181015 ts:200811181340 I want to be able to search by any number of these fields. For example, I may want all of employee 5's requests for approvals from today. It may seem like it would be simpler to just have two separate indexes: a comments index and an invoice index. But I also want to be able to search the invoice description along with the comments. I could set the granularity of the index to the comments level, but then I am duplicating a lot of text in the invoice description. Also, I only care about returning the invoice, so I will have to merge results if the granularity is set to the comments level, which will ruin Lucene's scoring (?). This is a made-up example, but I think it describes pretty thoroughly the problem I'm trying to solve. In my real world problem, I'm storing the full-text of web pages, and I really don't want to be duplicating that much text to set the granularity lower. Mark Ferguson On Tue, Nov 18, 2008 at 2:29 PM, Mark Ferguson [EMAIL PROTECTED]wrote: Thanks for the suggestion, but I think I will need a more robust solution, because this will only work with pairs of fields. I should have specified that the example I gave was somewhat contrived, but in practice there could be more than two parallel fields. I'm trying to find a general solution that I can apply to any number of parallel fields holding any kind of data. I was thinking of trying something along the lines of a multi-value field. So for example, I could have: page_user_title: ajax|news (where | is a field separator) The problem is I don't know how to formulate the query that would be equivalent to +username:ajax +page_title:news, or if it's even possible. (I should also mention that I am creating the queries programmatically, not using the query parser, so anything goes). Any other ideas? Mark Ferguson On Tue, Nov 18, 2008 at 1:06 PM, Ian Lea [EMAIL PROTECTED] wrote: How about using variable field names? url: http://www.cnn.com/ page_description: cnn breaking news page_title_ajax: news page_title_paris: cnn news page_title_daniel: homepage username: ajax username: paris username: daniel and search for +user:ajax +page_title_ajax:news or maybe just page_title_ajax:news. Might not even need to store user. -- Ian. On Tue, Nov 18, 2008 at 5:48 PM, Mark Ferguson [EMAIL PROTECTED] wrote: Hello, I am designing an index in which one url corresponds to one document. Each document also contains multiple parallel repeating fields. For example: Document 1: url: http://www.cnn.com/ page_description: cnn breaking news page_title: news page_title: cnn news page_titel: homepage username: ajax username: paris username: daniel In this contrived example, user 'ajax' have saved the URL with the page title 'news', 'paris' has saved it with 'cnn news', and 'daniel' has saved it with 'homepage'. What I need to be able to do is perform a search for a particular user and a particular title, but they must occur together. For example, +user:ajax +page_title:news would return this document, but +user:ajax +page_title:homepage would not. I am open to changing the design of the document (i.e. using repeating fields isn't required), but I do need to have one document per url. I am looking for suggestions for a strategy on implementing this requirement. Thanks, Mark Ferguson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching repeating fields
There has been discussion in the past about how PhraseQuery artificially requires that the Terms you add to it must be in the same field ... you could theoretically modify PhraseQuery to have a tpe of query that required terms in one field be withing (slop)N positions of a term in a parallel field ... with N=0 you would get something like what you're describing... http://www.nabble.com/Re%3A-One-item%2C-multiple-fields%2C-and-range-queries-p8377712.html (that thread oes on to discuss the complexities of trying to make something like this work if one of the query clauses you want in your phrase is non-trivial like a RangeQuery) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term numbering and range filtering
I've finished a query time implementation of a column stride filter, which implements DocIdSetIterator. This just builds the filter at process start and uses it for each subsequent query. The index itself is unchanged. The results are very impressive. Here are the results on a 45M document index: Firstly without an age constraint as a baseline: Query +name:tim startup: 0 Hits: 15089 first query: 1004 100 queries: 132 (1.32 msec per query) Now with a cached filter. This is ideal from a speed standpoint but there are too many possible start/end combinations to cache all the filters. Query +name:tim age:[18 TO 35] (ConstantScoreQuery on cached RangeFilter) startup: 3 Hits: 11156 first query: 1830 100 queries: 287 (2.87 msec per query) Now with an uncached filter. This is awful. Query +name:tim age:[18 TO 35] (uncached ConstantScoreRangeQuery) startup: 3 Hits: 11156 first query: 1665 100 queries: 51862 (yes, 518 msec per query, 200x slower) A RangeQuery is slightly better but still bad (and has a different result set) Query +name:tim age:[18 TO 35] (uncached RangeQuery) startup: 0 Hits: 10147 first query: 1517 100 queries: 27157 (271 msec is 100x slower than the filter) Now with the prebuilt column stride filter: Query +name:tim age:[18 TO 35] (ConstantScoreQuery on prebuilt column stride filter) startup: 2811 Hits: 11156 first query: 1395 100 queries: 441 (back down to 4.41msec per query) This is less than 2x slower than the dedicated bitset and more than 50x faster than the range boolean query. Mike, Paul, I'm happy to contribute this (ugly but working) code if there is interest. Let me know and I'll open a JIRA issue for it. Tim On 11/11/08 1:27 PM, Michael McCandless [EMAIL PROTECTED] wrote: Paul Elschot wrote: Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless: Also, one nice optimization we could do with the term number column- stride array is do bit packing (borrowing from the PFOR code) dynamically. Ie since we know there are X unique terms in this segment, when populating the array that maps docID to term number we could use exactly the right number of bits. Enumerated fields with not many unique values (eg, country, state) would take relatively little RAM. With LUCENE-1231, where the fields are stored column stride on disk, we could do this packing during index such that loading at search time is very fast. Perhaps we'd better continue this at LUCENE-1231 or LUCENE-1410. I think what you're referring to is PDICT, which has frame exceptions for values that occur infrequently. Yes let's move the discussion to Jira. Actually I was referring to simple bit-packing. For encoding array of compact enum terms (eg city, state, color, zip) I'm guessing the exceptions logic won't buy us much and would hurt seeking needed for column-stride fields. But we should certainly test it. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term numbering and range filtering
Op Wednesday 19 November 2008 00:43:56 schreef Tim Sturge: I've finished a query time implementation of a column stride filter, which implements DocIdSetIterator. This just builds the filter at process start and uses it for each subsequent query. The index itself is unchanged. The results are very impressive. Here are the results on a 45M document index: Firstly without an age constraint as a baseline: Query +name:tim startup: 0 Hits: 15089 first query: 1004 100 queries: 132 (1.32 msec per query) Now with a cached filter. This is ideal from a speed standpoint but there are too many possible start/end combinations to cache all the filters. Query +name:tim age:[18 TO 35] (ConstantScoreQuery on cached RangeFilter) startup: 3 Hits: 11156 first query: 1830 100 queries: 287 (2.87 msec per query) Now with an uncached filter. This is awful. Query +name:tim age:[18 TO 35] (uncached ConstantScoreRangeQuery) startup: 3 Hits: 11156 first query: 1665 100 queries: 51862 (yes, 518 msec per query, 200x slower) A RangeQuery is slightly better but still bad (and has a different result set) Query +name:tim age:[18 TO 35] (uncached RangeQuery) startup: 0 Hits: 10147 first query: 1517 100 queries: 27157 (271 msec is 100x slower than the filter) Now with the prebuilt column stride filter: Query +name:tim age:[18 TO 35] (ConstantScoreQuery on prebuilt column stride filter) With Allow Filter as clause to BooleanQuery: https://issues.apache.org/jira/browse/LUCENE-1345 one could even skip the ConstantScoreQuery with this. Unfortunately 1345 is unfinished for now. startup: 2811 Hits: 11156 first query: 1395 100 queries: 441 (back down to 4.41msec per query) This is less than 2x slower than the dedicated bitset and more than 50x faster than the range boolean query. Mike, Paul, I'm happy to contribute this (ugly but working) code if there is interest. Let me know and I'll open a JIRA issue for it. In case you think more performance improvements based on this are possible... Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term numbering and range filtering
With Allow Filter as clause to BooleanQuery: https://issues.apache.org/jira/browse/LUCENE-1345 one could even skip the ConstantScoreQuery with this. Unfortunately 1345 is unfinished for now. That would be interesting; I'd like to see how much performance improves. startup: 2811 Hits: 11156 first query: 1395 100 queries: 441 (back down to 4.41msec per query) This is less than 2x slower than the dedicated bitset and more than 50x faster than the range boolean query. Mike, Paul, I'm happy to contribute this (ugly but working) code if there is interest. Let me know and I'll open a JIRA issue for it. In case you think more performance improvements based on this are possible... I think this is generally useful for range and set queries on non-text based fields (dates, location data, prices, general enumerations). These all have the required property that there is only one value (term) per document. I've opened LUCENE-1461. Tim Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: InstantiatedIndex help + first impression
The actual performance depends on how much you load to the index. Can you tell us how many documents and how large these documents are that you have in your index? Compared with RAMDirectory I'vee seen performance boosts of up to 100x in a small index that contains (1-20) Wikipedia sized documents, an index I used to apply user search agents on as new data arrived to the primary index. up to 25x when placing massive amounts of span queries on the apriori index in LUCENE-626. This index contained tens of thousands of documents with only a few (5-20) terms each. up to 15x in a relatively large ngram index for classifications using LUCENE-626. This is pure skipTo operations. Regarding the fuzzy query, try to see how much time was spent rewriting the query and then how much time was spend querying. I'm almost certain you'll notice that the time spent rewriting the query (comparing edit distance between the terms of the index and the query term) is overwhelming compared to the time spend searching for the rewritten query. I.e. this is probably as much a store related expense as it is a Levenshtein calculation expense. karl (this is my second reply, the first one seems to be lost in space?) On Mon, Nov 17, 2008 at 1:37 AM, Darren Govoni [EMAIL PROTECTED] wrote: After I switched to InstantiatedIndex from RAMDirectory (but using the reader from my RAMDirectory to create the InstantiatedIndex), I see a less than 25% (.25) improvement in speed. Nowhere near the 100x (100.00) speed mentioned in the documentation. Probably I am doing something wrong. I am using too, a fuzzy query. e.g. word:house~0.80 but I'd expect the improvement to be because of physical representation (memory graph) and mostly unaffected by the query. no? Could there be some lazy loading going on in RAMDirectory that prevents InstantiatedIndex from building out its graph and getting the expected speed? thanks to anyone who can verify this. On Sun, 2008-11-16 at 12:37 -0500, Darren Govoni wrote: Yeah. That makes sense. Its not too hard to wrap those extra steps so I can end up with something simpler too. Like: iindex = InstantiatedIndex(path/to/my/index) I'm lazy so the intermediate hoops to jump through clutter my code. Hehe. :) Darren On Sun, 2008-11-16 at 11:46 -0500, Mark Miller wrote: Can you start with an empty index? Then how about: // Adding these iindex = InstantiatedIndex() ireader = iindex.indexReaderFactory() isearcher = IndexSearcher(ireader) If you want a copy from another IndexReader though, you have to get that reader from somewhere right? - Mark Darren Govoni wrote: Hi Mark, Thanks for the tips. Here's what I will try (psuedo-code) endirectory = RAMDirectory(index/dictionary.en) ensearcher = IndexSearcher(endirectory) // Adding these reader = ensearcher.getIndexReader() iindex = InstantiatedIndex(reader) ireader = iindex.indexReaderFactory() isearcher = IndexSearcher(ireader) Kind of round about way to get an InstantiatedIndex I guess,but maybe there's a briefer way? Thank you. Darren On Sun, 2008-11-16 at 10:50 -0500, Mark Miller wrote: Check out the docs at: http://lucene.apache.org/java/2_4_0/api/contrib-instantiated/index.html There is a performance graph there to check out. The code should be fairly straightforward - you can make an InstantiatedIndex thats empty, or seed it with an IndexReader. Then you can make an InstantiatedReader or Writer, which take the InstantiatedIndex as a constructor arg. You should be able to just wrap that InstantiatedReader in a regular Searcher. Darren Govoni wrote: Hi gang, I am trying to trace the 2.4 API to create an InstantiatedIndex, but its rather difficult to connect directory,reader,search,index etc just reading the javadocs. I have a (POI - plain old index) directory already and want to create a faster InstantiatedIndex and IndexSearcher to query it like before. What's the proper order to do this? Also, if anyone has any empirical data on the performance or reliability of InstantiatedIndex, I'd be curious. Thanks for the tips! Darren - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,
2.4 Performance
On an index of around 20 gigs I've been seeing a performance drop of around 35% after upgrading to 2.4 (measured on ~1 requests identical requests, executed in parallel against a threaded lucene / apache setup, after a roughly 1 query warmup). The principal changes I've made so far are just to switch to NIOFSDirectories and use read-only index readers. Our design is roughly as follows: we have some pre-query filters, queries typically involving around 25 clauses, and some post-processing of hits. We collect counts and filter post query using a hit collector, which uses the (now deprecated) bits() method of Filters. I looked at converting us to use the new DocIdSet infrastructure (to gain the supposed 30% speed bump), but this seems to be somewhat problematic as there is no guarantee for whether we will get back a set we can do binary operations on (for example, if we get back a SortedVIntList, we're pretty much out of luck - the cardinality of the set is large (as it's a sortedvintlist), so we can't coerce it into another type, and it doesn't have the set operations we need to use it directly. Has anyone else seen this? Is there anything else we should be changing in the upgrade to 2.4? Thanks, -Matt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: InstantiatedIndex help + first impression
On Wed, Nov 19, 2008 at 3:27 AM, karl wettin [EMAIL PROTECTED] wrote: rewritten query. I.e. this is probably as much a store related expense as it is a Levenshtein calculation expense. this is probably *not* as much a store related.. that is. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reopen IndexReader
I had same kind of problem and I somehow managed to find a work around by initializing IndexSearcher from new reader. try { IndexReader newReader = reader.reopen(); if (newReader != reader) { // reader was reopened reader.close(); reader = null; } reader = newReader; searcher = new IndexSearcher(newReader); } catch (Exception e) { e.printStackTrace(); } --- On Tue, 11/18/08, Michael McCandless [EMAIL PROTECTED] wrote: From: Michael McCandless [EMAIL PROTECTED] Subject: Re: Reopen IndexReader To: java-user@lucene.apache.org Date: Tuesday, November 18, 2008, 7:52 AM Well... we certainly do our best to have each release be stable, but we do make mistakes, so you'll have to use your own judgement on when to upgrade. However, it's only through users like yourself upgrading that we then find fix any uncaught issues in each new release. Mike Ganesh wrote: I am creating IndexSearcher using String, this is working fine with version 2.3.2. I tried by replacing Directory ctor of IndexSearcher and it is working fine with v2.4. I have recently upgraded from v2.3.2 to 2.4. Is v2.4 stable and i could more forward with this or shall i revert back to 2.3.2? Regards Ganesh - Original Message - From: Michael McCandless [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, November 18, 2008 4:59 PM Subject: Re: Reopen IndexReader Did you create your IndexSearcher using a String or File (not Directory)? If so, it sounds like you are hitting this issue (just fixed this morning, on 2.9-dev (trunk)): https://issues.apache.org/jira/browse/LUCENE-1453 The workaround is to use the Directory ctor of IndexSearcher. Mike Ganesh wrote: Hello all, I am using version 2.4. The following code throws AlreadyClosedException IndexReader reader = searcher.getIndexReader(); IndexReader newReader = reader.reopen(); if (reader != newReader) { reader.close(); boolean isCurrent = newReader.isCurrent(); //throws exception } Full list of exception: org.apache.lucene.store.AlreadyClosedException: this Directory is closed at org.apache.lucene.store.Directory.ensureOpen(Directory.java: 220) at org.apache.lucene.store.FSDirectory.list(FSDirectory.java: 320) at org.apache.lucene.index.SegmentInfos $FindSegmentsFile.run(SegmentInfos.java:533) at org .apache .lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366) at org .apache .lucene .index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188) at MailIndexer.IndexSearcherEx.reOpenDB(IndexSearcherEx.java: 102) Please correct me, if i am wrong. Regards Ganesh Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Reg two versions of lucene on the same machine
Hi, I am trying to upgrade the version of Lucene from 1.2 to 2.4. Can we do this directly? Is it possible to have two versions of Lucene on the same machine.? Shireesha This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: Reg two versions of lucene on the same machine
Hi Shireesha, I'm not sure as to what is it that you have been using, but 'm kinda sure that you'd have to check for deprecated things as well as improved ones while upgrading.. 1.2 to 2.4 is a huge jump certainly, with compound index structure etc. coming into place. You would have to try it and check if your code works the same(I doubt it would though). About having 2 versions of lucene on the same machine, ofcourse yes, it is as good as having 2 (or more) java jars. I am presuming that you place your lucene core jar in the project library directory and not in the jre/lib/ext directory, in which case you would have issues placing the 2 jars. It would be better of you completely remove lucene jars from the implicit included library dir, and place them in a different folder (and include that in your classpath). Hope that solves a bit of your doubt (atleast) ! -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Wed, Nov 19, 2008 at 11:57 AM, [EMAIL PROTECTED] wrote: Hi, I am trying to upgrade the version of Lucene from 1.2 to 2.4. Can we do this directly? Is it possible to have two versions of Lucene on the same machine.? Shireesha This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.