Problem with sorting on NumericFields
I got stuck on a problem using NumericFields using with lucene 2.9.3 I add values to the document by doc.add(new NumericField(minprice).setDoubleValue(net_price)); If I want to search with a sorter for this field, I get this error: java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really an INT?) at org.apache.lucene.util.NumericUtils.prefixCodedToInt(NumericUtils.java:2 33) at org.apache.lucene.search.FieldCache$8.parseFloat(FieldCache.java:256) at org.apache.lucene.search.FieldCacheImpl$FloatCache.createValue(FieldCach eImpl.java:514) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22 4) at org.apache.lucene.search.FieldCacheImpl.getFloats(FieldCacheImpl.java:48 7) at org.apache.lucene.search.FieldCacheImpl$FloatCache.createValue(FieldCach eImpl.java:504) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22 4) at org.apache.lucene.search.FieldCacheImpl.getFloats(FieldCacheImpl.java:48 7) at org.apache.lucene.search.FieldComparator$FloatComparator.setNextReader(F ieldComparator.java:269) at org.apache.lucene.search.TopFieldCollector$MultiComparatorNonScoringColl ector.setNextReader(TopFieldCollector.java:435) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:257) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:240) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:181) at org.apache.lucene.search.Searcher.search(Searcher.java:90) The Sort field as seen by the debugger: * sort_fields = {org.apache.lucene.search.sortfield...@9010} * [0] = {org.apache.lucene.search.sortfi...@9011}! * field = {java.lang.str...@8642}minprice * type = 5 * locale = null * reverse = true * factory = null * parser = null * comparatorSource = null * useLegacy = false I run out of ideas what might go wrong. I did look at the index with luke and I do not see anything special. As this happens with the same code on other servers, too, It looks like some kind of programming error. Any hints? Thx Uwe
AW: Problem with sorting on NumericFields
Thx Uwe, after sleeping over the problem... The solution just hit me ;) I index a double for the Numeric field but my Sortfield was setup as a float. (Maybe this is something for a FAQ for NumericFields) Thx Uwe -Ursprüngliche Nachricht- Von: Uwe Schindler [mailto:u...@thetaphi.de] Gesendet: Dienstag, 26. Oktober 2010 09:30 An: java-user@lucene.apache.org Betreff: RE: Problem with sorting on NumericFields This happens if your field still contains other value types in this field, maybe from deleted documents. The problem is that even if no document contains the old field encoding anymore, it could still be leftover terms in terms index. So the FieldCache code loads the terms (even if no longer documents are attached) and tries to parse it. So if it was before a different field type like a conventional plain text encoded numeric, the parsing of those old terms fails. You should reindex the whole stuff or at least try to optimize the index to get rid of deleted documents and the terms. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Goetzke [mailto:uwe.goet...@veenion.de] Sent: Monday, October 25, 2010 9:43 PM To: java-user@lucene.apache.org Subject: Problem with sorting on NumericFields I got stuck on a problem using NumericFields using with lucene 2.9.3 I add values to the document by doc.add(new NumericField(minprice).setDoubleValue(net_price)); If I want to search with a sorter for this field, I get this error: java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really an INT?) at org.apache.lucene.util.NumericUtils.prefixCodedToInt(NumericUtils.java:2 33) at org.apache.lucene.search.FieldCache$8.parseFloat(FieldCache.java:256) at org.apache.lucene.search.FieldCacheImpl$FloatCache.createValue(FieldCach eImpl.java:514) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22 4) at org.apache.lucene.search.FieldCacheImpl.getFloats(FieldCacheImpl.java:48 7) at org.apache.lucene.search.FieldCacheImpl$FloatCache.createValue(FieldCach eImpl.java:504) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22 4) at org.apache.lucene.search.FieldCacheImpl.getFloats(FieldCacheImpl.java:48 7) at org.apache.lucene.search.FieldComparator$FloatComparator.setNextReader(F ieldComparator.java:269) at org.apache.lucene.search.TopFieldCollector$MultiComparatorNonScoringColl ector.setNextReader(TopFieldCollector.java:435) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:257) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:240) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:181) at org.apache.lucene.search.Searcher.search(Searcher.java:90) The Sort field as seen by the debugger: * sort_fields = {org.apache.lucene.search.sortfield...@9010} * [0] = {org.apache.lucene.search.sortfi...@9011}! * field = {java.lang.str...@8642}minprice * type = 5 * locale = null * reverse = true * factory = null * parser = null * comparatorSource = null * useLegacy = false I run out of ideas what might go wrong. I did look at the index with luke and I do not see anything special. As this happens with the same code on other servers, too, It looks like some kind of programming error. Any hints? Thx Uwe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
AW: How can I merge .cfx and .cfs into a single cfs file?
Index all into a directory and determine the size of all files in it. From http://lucene.apache.org/java/3_0_1/fileformats.html Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx. In addition to Compound File .cfsAn optional virtual file consisting of all the other index files for systems that frequently run out of file handles. Uwe -Ursprüngliche Nachricht- Von: 张志田 [mailto:zhitian.zh...@dianping.com] Gesendet: Mittwoch, 5. Mai 2010 08:24 An: java-user@lucene.apache.org Betreff: How can I merge .cfx and .cfs into a single cfs file? Hi all, I have an index task which will index thousands of records with lucene 3.0.1. My confusion is lucene will always create a .cfx and a .cfs file in the file system, sometimes more, while I thought it should create a single .cfs file if I optimize the index data. Is it by design? If yes, is there any way/configuration I can do to merge all of the index files into a singe one? By the way, I have a logic to validate the index data, if the size of .cfs increases dramatically comparing to the file generated last time, there may be something wrong, a warning message will be threw. This is the reason that I want to generate a single .cfs file. Any other suggestion about the index validation? Any body can give me a hand? Thanks in advance. Garry - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
AW: Relevancy Practices
Regarding Part3: Data quality For our search domain (catalog products) we face very often the problem that the search data is full of acronyms and abbreviations like: cable,nym-j,pvc,3x2.5mm² or dvd-/cd-/usb-carradio,4x50W,divx,bl We solved this by a combination of normalization for better data quality (or less variations) and some tolerant sloppy phrase search where the search token needs only to partly match an indexed token. We use here a dictionary lookup approach into the indexed tokens of some fields and expand the users query with a well weighted set of search terms. It took us some iterations to get this right and fast enough to search in several million products. The next step on our list are facets. Uwe -Ursprüngliche Nachricht- Von: mbennett.idea...@gmail.com [mailto:mbennett.idea...@gmail.com] Im Auftrag von Mark Bennett Gesendet: Donnerstag, 29. April 2010 16:59 An: java-user@lucene.apache.org Betreff: Re: Relevancy Practices Hi Grant, You're welcome to use any of my slides (Dave's got them), with attribution of course. BUT Have you considered a section something like why the hell do you think Relevancy tweaking is gonna save you!?!? Basically that, as a corpus grows exponentially, so do results list sizes, so ALL relevancy tweaks will eventually fail. And FACETS (or other navigators) are the answer. I've got slides on that as well. Of course relevancy matters but it's only ONE of perhaps a three pronged approach: 1: Organic Relevancy and top query suggetions 2: Results list Navigators, the best the system can support, and 3: Data quality (spidering, METADATA quality, source weighting, etc) Mark -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 On Thu, Apr 29, 2010 at 7:14 AM, Grant Ingersoll gsing...@apache.orgwrote: I'm putting on a talk at Lucene Eurocon ( http://lucene-eurocon.org/sessions-track1-day2.html#1) on Practical Relevance and I'm curious as to what people put in practice for testing and improving relevance. I have my own inclinations, but I don't want to muddy the water just yet. So, if you have a few moments, I'd love to hear responses to the following questions. What worked? What didn't work? What didn't you understand about it? What tools did you use? What tools did you wish you had either for debugging relevance or fixing it? How much time did you spend on it? How did you avoid over/under tuning? What stage of development/testing/production did you decide to do relevance tuning? Was that timing planned or not? Thanks, Grant - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
MergePolicy$MergeException because of FileNotFoundException because wrong path to index-file
We have an IndexWriter.optimize running on 4 Proc Xenon Java 1.5 Win2003 machine. We get a repeatable FileNotFoundException because the path to the file is wrong: D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c fs Instead of D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140\_0. cfs I have no idea what is different here because we use the same code successfully on other machines (even multi-core) 1. 2009.08.28 13:10:30 : [B:60043][N:org.apache.lucene.index.MergePolicy$MergeException] org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c fs (The system cannot find the file specified) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co ncurrentMergeScheduler.java:309) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr entMergeScheduler.java:286) Caused by: java.io.FileNotFoundException: D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c fs (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:212) at org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.init(FSDir ectory.java:552) at org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java :582) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488) at org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.jav a:70) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:321) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:260) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4220) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3884) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMerge Scheduler.java:205) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr entMergeScheduler.java:260) 2. 2009.08.28 13:10:31 : [B:60043][N:java.io.IOException] java.io.IOException: background merge hit exception: _0:c71339-_0 _1:c36232-_0 _2:c37691-_0 _3:c29335-_0 _4:c29954-_0 _5:c33617-_0 _6:c37092-_0 _7:c35483-_0 _8:c25244-_0 _9:c31566-_0 _a:c4891-_0 into _b [optimize] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2273) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2218) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2198) I have looked through the code of FSDirectory // Inherit javadoc public IndexInput openInput(String name, int bufferSize) throws IOException { ensureOpen(); return new FSIndexInput(new File(directory, name), bufferSize); } Checking further, one would assume that in Win32FileSystem the following would be not set slash = ((String) AccessController.doPrivileged( new GetPropertyAction(file.separator))).charAt(0); Which sounds more than strange to me... Any idea? Regards Uwe Goetzke --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
AW: MergePolicy$MergeException because of FileNotFoundException because wrong path to index-file
Ups, sorry 2.4.1 Thx Uwe Goetzke -Ursprüngliche Nachricht- Von: Uwe Schindler [mailto:u...@thetaphi.de] Gesendet: Montag, 31. August 2009 17:42 An: java-user@lucene.apache.org Betreff: RE: MergePolicy$MergeException because of FileNotFoundException because wrong path to index-file Which Lucene Version? The RC2 of 2.9? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Goetzke [mailto:uwe.goet...@healy-hudson.com] Sent: Monday, August 31, 2009 5:40 PM To: java-user@lucene.apache.org Subject: MergePolicy$MergeException because of FileNotFoundException because wrong path to index-file We have an IndexWriter.optimize running on 4 Proc Xenon Java 1.5 Win2003 machine. We get a repeatable FileNotFoundException because the path to the file is wrong: D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c fs Instead of D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140\_0. cfs I have no idea what is different here because we use the same code successfully on other machines (even multi-core) 1. 2009.08.28 13:10:30 : [B:60043][N:org.apache.lucene.index.MergePolicy$MergeException] org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c fs (The system cannot find the file specified) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co ncurrentMergeScheduler.java:309) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr entMergeScheduler.java:286) Caused by: java.io.FileNotFoundException: D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c fs (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:212) at org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.init(FSDir ectory.java:552) at org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java :582) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488) at org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.jav a:70) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:321) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:260) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4220) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3884) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMerge Scheduler.java:205) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr entMergeScheduler.java:260) 2. 2009.08.28 13:10:31 : [B:60043][N:java.io.IOException] java.io.IOException: background merge hit exception: _0:c71339-_0 _1:c36232-_0 _2:c37691-_0 _3:c29335-_0 _4:c29954-_0 _5:c33617-_0 _6:c37092-_0 _7:c35483-_0 _8:c25244-_0 _9:c31566-_0 _a:c4891-_0 into _b [optimize] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2273) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2218) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2198) I have looked through the code of FSDirectory // Inherit javadoc public IndexInput openInput(String name, int bufferSize) throws IOException { ensureOpen(); return new FSIndexInput(new File(directory, name), bufferSize); } Checking further, one would assume that in Win32FileSystem the following would be not set slash = ((String) AccessController.doPrivileged( new GetPropertyAction(file.separator))).charAt(0); Which sounds more than strange to me... Any idea? Regards Uwe Goetzke --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr
AW: Most frequently indexed term
Hello Ganesh, What about making a seperate index for each day, get your analysis and merge thereafter that index. I am not sure but I think this might work. Use MultiSearcher for the search. Regards Uwe Goetzke -Ursprüngliche Nachricht- Von: Ganesh [mailto:emailg...@yahoo.co.in] Gesendet: Montag, 8. Juni 2009 12:31 An: java-user@lucene.apache.org Betreff: Re: Most frequently indexed term Thanks. This works well. The logic is 1. Do the search, For every document get the list of terms and its frequency. 2. Use SortedTermVectorMapper to generate a list of unique terms and its frequency. 2. Sort them to get the list of top numbered frequently indexed terms in a given date range (any given criteria). My Question is: I need to get the top 20 highly indexed term in a day. 1 million documents could be indexed in a day. I need to traverse the 1 million records and store the unique terms and its frequencies. It may consume huge amount of memory. Is there any other way out? With out using term vector, i could get the list of most frequently indexed term in a database. Similarly is there any other way to get the list of most frequently indexed term in a date range or a subset of database. Regards Ganesh - Original Message - From: Preetham Kajekar preet...@cisco.com To: java-user@lucene.apache.org Sent: Tuesday, May 26, 2009 11:08 PM Subject: Re: Most frequently indexed term Have a look at http://stackoverflow.com/questions/195434/how-can-i-get-top-terms-for-a-subset-of-documents-in-a-lucene-index (I have not tried the above out) Ganesh wrote: Hello All, I need to build some stats. I need to know Top 5 frequently indexed term in a date range (In a day or a Month). Any idea of how to achieve this. Regards GaneshIéÝŠ{-j{fzËë-£*.®‰åŠwŸ®'§vÈm¶ŸÿŠyž²Ç§êòj(com= - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org IéÝŠ{-j{fzËë-£*.®‰åŠwŸ®'§vÈm¶ŸÿŠyž²Ç§êòj( --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
AW: Transforming german umlaute like ö,ä,ü ,ß into oe, ae, ue, ss
case '\u00EE' : // î case '\u00EF' : // ï output.append(i); break; case '\u00F0' : // ð output.append(d); break; case '\u00F1' : // ñ output.append(n); break; case '\u00F2' : // ò case '\u00F3' : // ó case '\u00F4' : // ô case '\u00F5' : // õ case '\u00F8' : // ø output.append(o); break; case '\u00F6' : // ö case '\u0153' : // œ output.append(oe); break; case '\u00DF' : // ß output.append(ss); break; case '\u00FE' : // þ output.append(th); break; case '\u00F9' : // ù case '\u00FA' : // ú case '\u00FB' : // û output.append(u); break; case '\u00FC' : // ü output.append(ue); break; case '\u00FD' : // ý case '\u00FF' : // ÿ output.append(y); break; default : output.append(input.charAt(i)); break; } } return output.toString(); } } Regards Uwe Goetzke Leiter Produktentwicklung Healy Hudson GmbH Procurement Retail Solutions -Ursprüngliche Nachricht- Von: Sascha Fahl [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 18. November 2008 13:07 An: java-user@lucene.apache.org Betreff: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss Hi, what is the best to transform the german umlaute ö,ä,ü,ß into oe, ae, ue, ss during the process of analyzing? Thanks, Sascha Fahl Softwareentwicklung evenity GmbH Zu den Mühlen 19 D-35390 Gießen Mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1
Hi Jay, Sorry for the confusion, I wrote NgramStemFilter in an early stage of the project which is essentially the same as NGramTokenFilter from Otis with the addition that I add begin and end token markers (e.g. word gets and _word_ and so _w wo rd d_ ). As I modified a lot of our lucene code which we developed since lucene version 1.2 to move to a 2.x version, I did not notice of the existence of NGramTokenFilter. Stemming is anyway not useful for our problem domain (product catalogs). We chained WhiteSpace Tokenizer with a modified version of ISOLatin1AccentFilter to nomalize some character based language aspects (e.g. ß = ss, ö = oe), then make the token Lowercase before getting the bigrams. The advantage for us is anyway the TolerantPhraseQuery (see my other post AW: Implement a relaxed PhraseQuery?) which gives us a first step for less language dependent searching. Regards Uwe -Ursprüngliche Nachricht- Von: yu [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 26. März 2008 05:26 An: java-user@lucene.apache.org Betreff: Re: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1 Sorry for my ignorance, I am looking for NgramStemFilter specifically. Are you suggesting that it's the same as NGramTokenFilter? Does it have stemming in it? Thanks again. Jay Otis Gospodnetic wrote: Sorry, I wrote this stuff, but forgot the naming. Look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: yu [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, March 26, 2008 12:04:33 AM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1 Hi Otis, I checked that contrib before and could not find NgramStemFilter. Am I missing other contrib? Thanks for the link! Jay Otis Gospodnetic wrote: Hi Jay, Sorry, lapsus calami, that would be Lucene *contrib*. Have a look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jay [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 6:15:54 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1 Sorry, I could not find the filter in the 2.3 API class list (core + contrib + test). I am not ware of lucene config file either. Could you please tell me where it is in 2.3 release? Thanks! Jay Otis Gospodnetic wrote: Jay, Have a look at Lucene config, it's all there, including tests. This filter will take a token such as foobar and chop it up into n-grams (e.g. foobar - fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram size and even min and max n-gram size. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jay [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, March 25, 2008 1:32:24 PM Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1 Hi Uwe, I am curious what NGramStemFilter is? Is it a combination of porter stemming and word ngram identification? Thanks! Jay Uwe Goetzke wrote: Hi Ivan, No, we do not use StandardAnalyser or StandardTokenizer. Most data is processed by fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader); result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter modified that ö - oe result = new org.apache.lucene.analysis.LowerCaseFilter(result); result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. Best Regards Uwe -Ursprüngliche Nachricht- Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 21. März 2008 16:25 An: java-user@lucene.apache.org Betreff: Re: feedback: Indexing speed improvement lucene 2.2-2.3.1 Hi Uwe, Could you tell what Analyzer do you use when you marked so big indexing speedup? If you use StandardAnalyzer (that uses StandardTokenizer) may be the reason is in it. You can see the pre last report in the thread Indexing Speed: 2.3 vs 2.2 (real world numbers). According to the reporter Jake Mannix this is because now StandardTokenizer uses StandardTokenizerImpl that now is generated by JFlex instead of JavaCC. I am asking because I noticed a great speedup in adding documents to index in our system. We have time control on this in the debug mode. NOW THEY ARE ADDED 5 TIMES FASTER!!! But in the same time the total process of indexing in our case has improvement of about 8%. As our system is very big and complex I am wondering if really
AW: feedback: Indexing speed improvement lucene 2.2-2.3.1
Jake, With the bigram-based index we gave up for the struggle to find a well working language based index. We had implemented soundex (or different sound-alikes) and hyphenating but failed to deliver a user explainable search result (why is this ranked higher and so on...). One reason may be that product descriptions contain a lot of abbreviations. The index size grew about 30%. The search performance seems a bit slower but I no concrete figures. The evaluation for a for one document is a bit more complex than a phrase query. One reason of course is that there a more terms evaluated. But nevertheless it is quite good. The search relevance improved tremendously. Missing characters, switched letters and partial word fragments are no real problems any more (of course dependent on the length of the search word). Search term weekday finds also day of the week, disabigaute finds disambiguate. The algorithms I developed might not fit other domains but for multi language catalogs of products it works quite well for us. So far... Regards Uwe -Ursprüngliche Nachricht- Von: Jake Mannix [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 25. März 2008 17:13 An: java-user@lucene.apache.org Betreff: Re: feedback: Indexing speed improvement lucene 2.2-2.3.1 Uwe, This is a little off thread-topic, but I was wondering how your search relevance and search performance has fared with this bigram-based index. Is it significantly better than before you use the NGramAnalyzer? -jake On 3/24/08, Uwe Goetzke [EMAIL PROTECTED] wrote: Hi Ivan, No, we do not use StandardAnalyser or StandardTokenizer. Most data is processed by fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader); result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter modified that ö - oe result = new org.apache.lucene.analysis.LowerCaseFilter(result); result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. Best Regards Uwe -Ursprüngliche Nachricht- Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 21. März 2008 16:25 An: java-user@lucene.apache.org Betreff: Re: feedback: Indexing speed improvement lucene 2.2-2.3.1 Hi Uwe, Could you tell what Analyzer do you use when you marked so big indexing speedup? If you use StandardAnalyzer (that uses StandardTokenizer) may be the reason is in it. You can see the pre last report in the thread Indexing Speed: 2.3 vs 2.2 (real world numbers). According to the reporter Jake Mannix this is because now StandardTokenizer uses StandardTokenizerImpl that now is generated by JFlex instead of JavaCC. I am asking because I noticed a great speedup in adding documents to index in our system. We have time control on this in the debug mode. NOW THEY ARE ADDED 5 TIMES FASTER!!! But in the same time the total process of indexing in our case has improvement of about 8%. As our system is very big and complex I am wondering if really the whole process of indexing is reduces so remarkably and our system causes this slowdown or may be Lucene does some optimizations on the index, merges or something else and this is the reason the total process of indexing to be not so reasonably faster. Best Regards, Ivan Uwe Goetzke wrote: This week I switched the lucene library version on one customer system. The indexing speed went down from 46m32s to 16m20s for the complete task including optimisation. Great Job! We index product catalogs from several suppliers, in this case around 56.000 product groups and 360.000 products including descriptions were indexed. Regards Uwe --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ NOD32 2913 (20080301) Information __ This message was checked by NOD32 antivirus system. http://www.eset.com
AW: feedback: Indexing speed improvement lucene 2.2-2.3.1
Hi Ivan, No, we do not use StandardAnalyser or StandardTokenizer. Most data is processed by fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader); result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter modified that ö - oe result = new org.apache.lucene.analysis.LowerCaseFilter(result); result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. Best Regards Uwe -Ursprüngliche Nachricht- Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 21. März 2008 16:25 An: java-user@lucene.apache.org Betreff: Re: feedback: Indexing speed improvement lucene 2.2-2.3.1 Hi Uwe, Could you tell what Analyzer do you use when you marked so big indexing speedup? If you use StandardAnalyzer (that uses StandardTokenizer) may be the reason is in it. You can see the pre last report in the thread Indexing Speed: 2.3 vs 2.2 (real world numbers). According to the reporter Jake Mannix this is because now StandardTokenizer uses StandardTokenizerImpl that now is generated by JFlex instead of JavaCC. I am asking because I noticed a great speedup in adding documents to index in our system. We have time control on this in the debug mode. NOW THEY ARE ADDED 5 TIMES FASTER!!! But in the same time the total process of indexing in our case has improvement of about 8%. As our system is very big and complex I am wondering if really the whole process of indexing is reduces so remarkably and our system causes this slowdown or may be Lucene does some optimizations on the index, merges or something else and this is the reason the total process of indexing to be not so reasonably faster. Best Regards, Ivan Uwe Goetzke wrote: This week I switched the lucene library version on one customer system. The indexing speed went down from 46m32s to 16m20s for the complete task including optimisation. Great Job! We index product catalogs from several suppliers, in this case around 56.000 product groups and 360.000 products including descriptions were indexed. Regards Uwe --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ NOD32 2913 (20080301) Information __ This message was checked by NOD32 antivirus system. http://www.eset.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Implement a relaxed PhraseQuery?
Hi Cuong , I have written a TolerantPhraseScorer starting with the code from PhraseScorer but I think I have modified it to much to be generally useful. We use it with bigramm clusters and therefore does not need the slop factor for scoring but have a tolerance factor (depending on the length of the phrase). Here are the most relevant code fragments to start with... So the idea is to keep queue ordered (calling firstToLast2 and moveLast). I have not yet checked the code for optimisations. If you find one, I would be glad to hear about it... ;-) protected TolerantPhrasePositions first, last, reallast; // last point to the last tpp for the doc varying from tolerance to phrase size (reallast) protected int tolerance; /** * similar to PhraseScorer but with a tolerance factor * * @see PhraseScorer */ TolerantPhraseScorer(Weight weight, TermPositions[] tps, int[] positions, Similarity similarity, byte[] norms, int tolerance) { super(similarity); this.norms = norms; this.weight = weight; this.value = weight.getValue(); this.tolerance = tolerance; termsize = 0; // convert tps to a list for (int i = 0; i tps.length; i++) { if (tps[i] != null) { TolerantPhrasePositions pp = new TolerantPhrasePositions(tps[i], positions[i]); termsize++; if (reallast != null) { // add next to end of list reallast.next = pp; pp.previous = reallast; } else first = pp; reallast = pp; if ((termsize = tolerance) (last == null)) last = pp; } } pq = new TolerantPhraseQueue(termsize); // construct empty pq } public boolean next() throws IOException { if (firstTime) { init(); firstTime = false; } else if (more) { int doc = last.doc; while (doc == last.doc) { more = last.next(); // trigger further scanning moveLast(); } } return doNext(); } // next without initial increment private boolean doNext() throws IOException { while (more) { while (more first.doc last.doc) { // find doc w/ all the terms more = first.skipTo(last.doc);// skip first upto last firstToLast2();// and move it to the end } if (more) { // found a doc with all of the terms freq = phraseFreq(); // check for phrase if (freq == 0.0f) {// no match int doc = last.doc; while (doc == last.doc) { more = last.next(); // trigger further scanning moveLast(); } } else return true; // found a match } } return false; // no more matches } private void firstToLast2() { TolerantPhrasePositions newfirst = first.next; TolerantPhrasePositions test = last; TolerantPhrasePositions insertp = test; while ((test != null) (first.doc = test.doc)) { insertp = test; test = test.next; } if (insertp == null) { // last elem should not happen System.out.println(firstToLast2-insertp==null); } else { first.previous = insertp; // einkoppeln first.next = insertp.next; if (first.next != null) first.next.previous = first; insertp.next = first;
feedback: Indexing speed improvement lucene 2.2-2.3.1
This week I switched the lucene library version on one customer system. The indexing speed went down from 46m32s to 16m20s for the complete task including optimisation. Great Job! We index product catalogs from several suppliers, in this case around 56.000 product groups and 360.000 products including descriptions were indexed. Regards Uwe --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email. This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Does Lucene support partition-by-keyword indexing?
Hi, I do not yet fully understand what you want to achieve. You want to spread the index split by keywords to reduce the time to distribute indexes? And you want the distribute queries to the nodes based on the same split mechanism? You have several nodes with different kind of documents. You want to build one index for all nodes and split and distribute the index based on a set of keywords specific to a node. This you want to do to split the queries so each query involves communicating with constant number of nodes. Do documents at the nodes contain only such keywords? I doubt. So you need anyway a reference where the indexed doc can be found and retrieve it from its node for display. You could index at each node, merge all indexes from all nodes and distribute the combined index. On what criteria you can split the queries? If you have a combined index each node can distribute the queries to other nodes on statistical data found in the term distribution. You need to merge the results anyway. I doubt that this kind of overhead is worth the trouble because you introduce a lot of single points of failure. And the scalability seems limited because you would need to recalibrate the whole network when a adding a new node. Why don't you distribute the complete index (we do this after getting it locally zipped and later unzipped on the receiver node, size is less than one third for transfering). Each node should have some activity indicator. Distribute the complete query to the node with the smallest activiy. So you get redundancy, do not need to split queries and merge results. OK, one evil query can bring a node down but the network is still working. Do you have any results using lucene on a single node for your approach? How many queries and how many documents do you expect? Regards Uwe -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von ?? Gesendet: Sonntag, 2. März 2008 03:05 An: java-user@lucene.apache.org Betreff: Re: Does Lucene support partition-by-keyword indexing? Hi, I agree with your point that it is easier to partition index by document. But the partition-by-keyword approach has much greater scalability over the partition-by-document approach. Each query involves communicating with constant number of nodes; while partition-by-doc requires spreading the query a long all or many of the nodes. And I am actually doing some small research on this. By the way, the documents to be indexed are not necessarily web pages. They are mostly files stored on each node's file system. Node failures are also handled by replicas. The index for each term will be replicated on multiple nodes, whose nodeIDs are near to each other. This mechanism is handled by the underlying DHT system. So any idea how can partition index by keyword in lucene? Thanks. On Sun, Mar 2, 2008 at 5:50 AM, Mathieu Lecarme [EMAIL PROTECTED] wrote: The easiest way is to split index by Document. In Lucene, index contains Document and inverse index of Term. If you wont to put Term in different place, Document will be duplicated on each index, with only a part of their Term. How will you manage node failure in your network? They were some trial to build big p2p search engine to compet with Google, but, it will be easier to split by Document. If you have to many computers and want to see them working together, why don't use Nutch with Hadoop? M. Le 1 mars 08 à 19:16, Yin Qiu a écrit : Hi, I'm planning to implement a search infrastructure on a P2P overlay. To achieve this, I want to first distribute the indices to various nodes connected by this overlay. My approach is to partition the indices by keyword, that is, one node takes care of certain keywords (or terms). When a simple TermQuery is encountered, we just find the node associated with that term (with distributed hash table) and get the result. And suppose a BooleanQuery is issued, we contact all the nodes involved in this query and finally merge the result. So my question is: does Lucene support partitioning the indices by keywords? Thanks in advance. -- Look before you leap --- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Look before you leap --- --- Healy Hudson GmbH - D-55252 Mainz Kastel Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach
Re: Chinese Segmentation with Phase Query
Hi Cedric, Although I have no idea how to use the Chinese language but I went a different route to overcome language specific problems. Instead of using a language specific segmentation we use now the statistical segmentation with bigrams e.g. Given a your sentence XYZABCDEF suppose the segmentation is XY YZ ZA AB BC CD DE EF A SpanNearQuery of (XY, BC, DE, EF) with distance of 10 should work to match this document. I am not sure if this works in your case because we index product information and their descriptions which are not language friendly anyway because of the abbreviations) Regards Uwe Goetzke -Ursprüngliche Nachricht- Von: Cedric Ho [mailto:[EMAIL PROTECTED] Gesendet: Samstag, 10. November 2007 02:28 An: java-user@lucene.apache.org Betreff: - Re: Chinese Segmentation with Phase Query On Nov 10, 2007 2:08 AM, Steven A Rowe [EMAIL PROTECTED] wrote: Hi Cedric, On 11/08/2007, Cedric Ho wrote: a sentence containing characters ABC, it may be segmented into AB, C or A, BC. [snip] In this cases we would like to index both segmentation into the index: AB offset (0,1) position 0A offset (0,0) position 0 C offset (2,2) position 1 BC offset (1,2) position 1 Now the problem is, when someone search using a PhraseQuery (AC) it will find this line ABC because it match A (position 0) and C (position 1). Are there any ways to search for exact match using the offset information instead of the position information ? Since you are writing the tokenizer (the Lucene term for the module that performs the segmentation), you yourself can substitute the beginning offset for the position. But I think that without the end offset, it won't get you what you want. For example, if your above example were indexed with beginning offsets as positions, a phrase query for AB, C will fail to match -- even though it should match -- because the segments' beginning offsets (0 and 2) are not contiguous. The new Payloads feature could provide the basis for storing beginning and ending offsets required to determine contiguity when matching phrases, but you would have to write matching and scoring for this representation, and that may not be the quickest route available to you. Solution #1: Create multiple fields, one for each full alternative segmentation, and then query against all of them. Solution #2: Store the alternative segmentations in the same field, but instead of interleaving the segments' positions, as in your example, make the position ranges of the alternatives non-contiguous. Recasting your example: lternative #1 Alternative #2 Alternative #3 - -- -- AB position 0 A position 100 A position 200 C position 1BC position 101 B position 201 C position 202 There is a problem with both of the above-described solutions: in my limited experience with Chinese segmentation, substantially less than half the text has alternative segmentations. As a result, the segments on which all of alternatives agree (call them uncontested segments) will have higher term frequencies than those segments which differ among the alternatives (contested segments). This means that document scores will be influenced by the variable density of the contested segments they contain. However, if you were to use my above-described Solution #1 along with a DisjunctionMaxQuery[1] as a wrapper around one query per alternative segmentation field, the term frequency problem would no longer be an issue. From the API doc for DisjunctionMaxQuery: A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries. This is useful when searching for a word in multiple fields with different boost factors (so that the fields cannot be combined equivalently into a single search field). We want the primary score to be the one associated with the highest boost, not the sum of the field scores (as BooleanQuery would give). Unlike the use-case mentioned above, where each field will be boosted differently, you probably don't have any information about the relative probability of the alternative segmentations, so you'll want to use the same boost for each sub-query. Steve [1] http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html -- Steve Rowe Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Hi Steve, We have actually thought about solution #1, and in our case, sorting by scoring is not a very important factor as well. However this would double the index size. A full index of our documents now would