Re: Throughput doesn't increase when using more concurrent threads
On Wednesday 25 January 2006 20:51, Peter Keegan wrote: The index is non-compound format and optimized. Yes, I did try MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors) Peter You could also give this a try: http://issues.apache.org/jira/browse/LUCENE-283 Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Getting the document number (with IndexReader)
I am attempting to prune an index by getting each document in turn and then checking/deleting it: IndexReader ir = IndexReader.open(path); for(int i=0;iir.numDocs();i++) { Document doc = ir.document(i); if(thisDocShouldBeDeleted(doc)) { ir.delete(docNum); // - I need the docNum for doc. } } How do I get the docNum for IndexReader.delete() function in the above case? Is there a API function I am missing? I am working with a merged index over different segments so the docNum might not be in running sequence with the counter i. In general, is there a better way to do this sort of thing? Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting the document number (with IndexReader)
On Thursday 26 January 2006 09:15, Chun Wei Ho wrote: I am attempting to prune an index by getting each document in turn and then checking/deleting it: IndexReader ir = IndexReader.open(path); for(int i=0;iir.numDocs();i++) { Document doc = ir.document(i); if(thisDocShouldBeDeleted(doc)) { ir.delete(docNum); // - I need the docNum for doc. } } How do I get the docNum for IndexReader.delete() function in the above case? Is there a API function I am missing? I am working with a merged The document number is the variable i in this case. index over different segments so the docNum might not be in running sequence with the counter i. In general, is there a better way to do this sort of thing? This code: Document doc = ir.document(i); normally retrieves all the stored fields of the document and that is quite costly. In case you know that the document(s) to be deleted match(es) a Term, it's better to use IndexReader.delete(Term). Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Speaking of NioFSDirectory, I thought there was one posted a while ago, is this something that can be used? http://issues.apache.org/jira/browse/LUCENE-414 ray, On 11/22/05, Doug Cutting [EMAIL PROTECTED] wrote: Jay Booth wrote: I had a similar problem with threading, the problem turned out to be that in the back end of the FSDirectory class I believe it was, there was a synchronized block on the actual RandomAccessFile resource when reading a block of data from it... high-concurrency situations caused threads to stack up in front of this synchronized block and our CPU time wound up being spent thrashing between blocked threads instead of doing anything useful. This is correct. In Lucene, multiple streams per file are created by cloning, and all clones of an FSDirectory input stream share a RandomAccessFile and must synchronize input from it. MmapDirectory does not have this limitation. If your indexes are less than a few GB or you are using 64-bit hardware, then MmapDirectory should work well for you. Otherwise it would be simple to write an nio-based Directory that does not use mmap that is also unsynchronized. Such a contribution would be welcome. Making multiple IndexSearchers and FSDirectories didn't help because in the back end, lucene consults a singleton HashMap of some kind (don't remember implementation) that maintained a single FSDirectory for any given index being accessed from the JVM... multiple calls to FSDirectory.getDirectory actually return the same FSDirectory object with synchronization at the same point. This does not make sense to me. FSDirectory does keep a cache of FSDirectory instances, but i/o should not be synchronized on these. One should be able to open multiple input streams on the same file from an FSDirectory. But this would not be a great solution, since file handle limits would soon become a problem. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting the document number (with IndexReader)
Hi, Thanks for the help, just a few more questions: On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote: On Thursday 26 January 2006 09:15, Chun Wei Ho wrote: I am attempting to prune an index by getting each document in turn and then checking/deleting it: IndexReader ir = IndexReader.open(path); for(int i=0;iir.numDocs();i++) { Document doc = ir.document(i); if(thisDocShouldBeDeleted(doc)) { ir.delete(docNum); // - I need the docNum for doc. } } How do I get the docNum for IndexReader.delete() function in the above case? Is there a API function I am missing? I am working with a merged The document number is the variable i in this case. If the document number is the variable i (enumerated from numDocs()), what's the difference between numDocs() and maxDoc() in this case? I was previously under the impression that the internal docNum might be different to the counter. index over different segments so the docNum might not be in running sequence with the counter i. In general, is there a better way to do this sort of thing? This code: Document doc = ir.document(i); normally retrieves all the stored fields of the document and that is quite costly. In case you know that the document(s) to be deleted match(es) a Term, it's better to use IndexReader.delete(Term). I'm doing something akin to a rangeQuery, where I delete documents within a certain range (in addition to other criteria). Is it better to do a query on the range, mark all the docNums getting them with Hits.id(), and then retrieve docs and test for deletion according to that? Thanks for the help - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
encoding
Hello, I 've a problem with data i try to index with lucene. I browse a directory and index text from different types of files throw parsers. For text files, data could be in different languages so different encoding. If data are in Turkish for exemple, all special characters and accents are not recognized in my lucene index. Is there a way to resolve problem? How do I work with the encoding ? Thanks for your help A.
Range number queries
For the recent questions about this here are a couple of methods for encoding/decoding long values that will be sorted into order by a range query public static String encodeLong(long num) { String hex = Long.toHexString(num 0 ? Long.MAX_VALUE - (0xL ^ num) : num); hex = (num 0 ? N : P)+.substring(0, 16-hex.length()) + hex; return hex; } public static long decodeLong(String hex) { long num = Long.parseLong(hex.substring(1,17), 16); return hex.charAt(0) == 'N' ? (Long.MAX_VALUE - num) ^ 0xL : num; } Hope this helps Mike www.ardentia.com the home of NetSearch
Re: Highlighter
Yes, that is correct...you need to rewrite the query. I was actually the main developer for the 1.5 .NET port, so if you come across any issues, please email me at my hotmail address which I check more often than this one... -Joe Langley -Original Message- From: Gwyn Carwardine [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tue, 24 Jan 2006 22:43:53 - Subject: RE: Highlighter Yes I think you're right. On reading the lucene in action chapted on highlighting I found it squirreled in the middle of the text. I get the feeling that whilst I have so far found query parser to be the primary method of building queries that this is not ht eprimary method used by other people. Otherwise I would have expected to see the first example in the book use query parser. So what I'm not quite sure is how come the norm is using the direct queries. it helped, thanks -Gwyn -Original Message- From: Koji Sekiguchi [mailto:[EMAIL PROTECTED] Sent: 24 January 2006 22:23 To: java-user@lucene.apache.org Subject: RE: Highlighter I've never used .net port of Lucene and highlighter, but I believe we have to call Query.rewrite() to expand the query expression when using phrasequery, wildcardquery, regexquery and fuzzyquery, then pass it to highlighter. hope this helps, Koji -Original Message- From: Gwyn Carwardine [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 25, 2006 6:28 AM To: java-user@lucene.apache.org Subject: Highlighter I'm using the .net port of highlighter (1.5) and I notice it doesn't highlight range or prefix queries.. Is this consistent with the java version? Only I note my standard reference of www.lucenebook.com seems to support highlighting.. is this using that same highlighter version (couldn't find any verison info on the lucene apache site) TIA -Gwyn - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE : encoding
Hello and thanks for your answer. I do not find the ISOLatin1AccentFilter class in my lucene jar, but I find one on google attach to this mail, could you tell me if it is the good one? I do not see anything in this class which can help me. This program will replace some accent characters but my problem is: if I try to index a text file encoded in Western 1252 for exemple with the Turkish text düzenlediğimiz kampanyamıza the lucene index will contain re encoded data with #0;#17;k#0;#0; Thanks regards A. -Message d'origine- De : John Haxby [mailto:[EMAIL PROTECTED] Envoyé : jeudi 26 janvier 2006 03:01 À : java-user@lucene.apache.org Objet : Re: encoding arnaudbuffet wrote: For text files, data could be in different languages so different encoding. If data are in Turkish for exemple, all special characters and accents are not recognized in my lucene index. Is there a way to resolve problem? How do I work with the encoding ? I've been looking at a similar problem recently. There's org.apache.lucene.analysis.ISOLatin1AccentFilter on the svn trunk which may be quite close to what you want. I have a perl script here that I used to generate downgrading table for a C program. I can let you have the perl script as is, but if there's enough interest(*) I'll use it to generate, say, CompoundAsciiFilter since it converts compound characters like á, æ, ffi (ffi-ligature, in case it doesn't display) to a, ae and ffi. It's actually built from http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt so it winds up having nearly 1200 entries. An earlier version converted all compound characters to their constient parts, but this version just converts characters that are made up entirely of ASCII and modifiers. jch (*) Any interest, actually. Might be enough for me to be interested. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Paul, I tried this but it ran out of memory trying to read the 500Mb .fdt file. I tried various values for MAX_BBUF, but it still ran out of memory (I'm using -Xmx1600M, which is the jvm's maximum value (v1.5)) I'll give NioFSDirectory a try. Thanks, Peter On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 25 January 2006 20:51, Peter Keegan wrote: The index is non-compound format and optimized. Yes, I did try MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors) Peter You could also give this a try: http://issues.apache.org/jira/browse/LUCENE-283 Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Ray, The throughput is worse with NioFSDIrectory than with the FSDIrectory (patched and unpatched). The bottleneck still seems to be synchronization, this time in NioFile.getChannel (7 of the 8 threads were blocked there during one snapshot). I tried this with 4 and 8 channels. The throughput with the patched FSDirectory was about the same as before the patch. Thanks, Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Speaking of NioFSDirectory, I thought there was one posted a while ago, is this something that can be used? http://issues.apache.org/jira/browse/LUCENE-414 ray, On 11/22/05, Doug Cutting [EMAIL PROTECTED] wrote: Jay Booth wrote: I had a similar problem with threading, the problem turned out to be that in the back end of the FSDirectory class I believe it was, there was a synchronized block on the actual RandomAccessFile resource when reading a block of data from it... high-concurrency situations caused threads to stack up in front of this synchronized block and our CPU time wound up being spent thrashing between blocked threads instead of doing anything useful. This is correct. In Lucene, multiple streams per file are created by cloning, and all clones of an FSDirectory input stream share a RandomAccessFile and must synchronize input from it. MmapDirectory does not have this limitation. If your indexes are less than a few GB or you are using 64-bit hardware, then MmapDirectory should work well for you. Otherwise it would be simple to write an nio-based Directory that does not use mmap that is also unsynchronized. Such a contribution would be welcome. Making multiple IndexSearchers and FSDirectories didn't help because in the back end, lucene consults a singleton HashMap of some kind (don't remember implementation) that maintained a single FSDirectory for any given index being accessed from the JVM... multiple calls to FSDirectory.getDirectory actually return the same FSDirectory object with synchronization at the same point. This does not make sense to me. FSDirectory does keep a cache of FSDirectory instances, but i/o should not be synchronized on these. One should be able to open multiple input streams on the same file from an FSDirectory. But this would not be a great solution, since file handle limits would soon become a problem. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?) We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons, Sun Java 1.5) -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Paul, I tried this but it ran out of memory trying to read the 500Mb .fdt file. I tried various values for MAX_BBUF, but it still ran out of memory (I'm using -Xmx1600M, which is the jvm's maximum value (v1.5)) I'll give NioFSDirectory a try. Thanks, Peter On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 25 January 2006 20:51, Peter Keegan wrote: The index is non-compound format and optimized. Yes, I did try MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors) Peter You could also give this a try: http://issues.apache.org/jira/browse/LUCENE-283 Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RE : encoding
On Jan 26, 2006, at 7:26 PM, arnaudbuffet wrote: I do not find the ISOLatin1AccentFilter class in my lucene jar, but I find one on google attach to this mail, could you tell me if it is the good one? This used to be in contrib/analyzers but has been moved into the core (Subversion only for now): http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/ apache/lucene/analysis/ I do not see anything in this class which can help me. This program will replace some accent characters but my problem is: if I try to index a text file encoded in Western 1252 for exemple with the Turkish text düzenlediğimiz kampanyamıza the lucene index will contain re encoded data with #0;#17;k#0;#0; Reading encoding files is your applications responsibility. You need to be sure to read the files in using the proper encoding. Once read properly into Java all will be well as far as Lucene indexing the characters. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: encoding
arnaudbuffet wrote: if I try to index a text file encoded in Western 1252 for exemple with the Turkish text düzenlediğimiz kampanyamıza the lucene index will contain re encoded data with #0;#17;k#0;#0; ISOLatin1AccentFilter.removeAccents() converts that string to duzenlediğimiz kampanyamıza The g-breve and the dotless-i are untouched. My AsciiDecomposeFilter.decompose() converts the string to duzenledigimiz kampanyamiza. However, since you're seeing those rather odd entities, it looks as though you're not actually indexing what you think you're indexing. As Erik says, you need to make sure that you're reading files with the proper encoding and removing accent and adding dots won't help. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
I'd love to try this, but I'm not aware of any 64-bit jvms for Windows on Intel. If you know of any, please let me know. Linux may be an option, too. btw, I'm getting a sustained rate of 135 queries/sec with 4 threads, which is pretty impressive. Another way around the concurrency limit is to run multiple jvms. The throughput of each is less, but the aggregate throughput is higher. Peter On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?) We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons, Sun Java 1.5) -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Paul, I tried this but it ran out of memory trying to read the 500Mb .fdt file. I tried various values for MAX_BBUF, but it still ran out of memory (I'm using -Xmx1600M, which is the jvm's maximum value (v1.5)) I'll give NioFSDirectory a try. Thanks, Peter On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 25 January 2006 20:51, Peter Keegan wrote: The index is non-compound format and optimized. Yes, I did try MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors) Peter You could also give this a try: http://issues.apache.org/jira/browse/LUCENE-283 Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
BEA Jrockit supports both AMD64 and Intel's EM64T (basically renamed AMD64) http://www.bea.com/framework.jsp?CNT=index.htmFP=/content/products/jrockit/ and Sun's Java 1.5 for Windows AMD64 Platform They advertize AMD64, presumably because that's what there servers use, but it should work on Intel's x86_64 (EM64T) also. The release notes have the following: With the release, J2SE support for Windows 64-bit has progressed from release candidate to final release. This version runs on AMD64/EM64T 64-bit mode machines with Windows Server 2003 x64 Editions. Of course, if the platform is up to you, I'd choose Linux :-) -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: I'd love to try this, but I'm not aware of any 64-bit jvms for Windows on Intel. If you know of any, please let me know. Linux may be an option, too. btw, I'm getting a sustained rate of 135 queries/sec with 4 threads, which is pretty impressive. Another way around the concurrency limit is to run multiple jvms. The throughput of each is less, but the aggregate throughput is higher. Peter On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?) We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons, Sun Java 1.5) -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Paul, I tried this but it ran out of memory trying to read the 500Mb .fdt file. I tried various values for MAX_BBUF, but it still ran out of memory (I'm using -Xmx1600M, which is the jvm's maximum value (v1.5)) I'll give NioFSDirectory a try. Thanks, Peter On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 25 January 2006 20:51, Peter Keegan wrote: The index is non-compound format and optimized. Yes, I did try MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors) Peter You could also give this a try: http://issues.apache.org/jira/browse/LUCENE-283 Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting the document number (with IndexReader)
: The document number is the variable i in this case. : If the document number is the variable i (enumerated from numDocs()), : what's the difference between numDocs() and maxDoc() in this case? I : was previously under the impression that the internal docNum might be : different to the counter. Iterating between 1 and maxDoc-1 will give you the range of all possible doc ids, but some of those docs may have already been deleted. I believe that is what you want to do. ... you can check if a doc is deleted using IndexReader.isDeleted(i) numDocs is implimented as maxDocs() - deletedDocs.count(), so i don't think it ever makes sese to iterate up to numDocs. : I'm doing something akin to a rangeQuery, where I delete documents : within a certain range (in addition to other criteria). Is it better : to do a query on the range, mark all the docNums getting them with : Hits.id(), and then retrieve docs and test for deletion according to : that? Take a look at the way RangeFilter.bits() is implimented. if you cut/paste that code and replace the call to bits.set(termDocs.doc()); with reader.delete(termDocs.doc()) I think you've have exactly what you want. Or, since cutting/pasting code is A Bad Thing from a maintenence/bug fixing standpoint, you could just call RangeFilter.bits(reader) yourself, and then iterate of the set bits and call delete on each one. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting the document number (with IndexReader)
On Thursday 26 January 2006 09:47, Chun Wei Ho wrote: Hi, Thanks for the help, just a few more questions: On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote: On Thursday 26 January 2006 09:15, Chun Wei Ho wrote: I am attempting to prune an index by getting each document in turn and then checking/deleting it: IndexReader ir = IndexReader.open(path); for(int i=0;iir.numDocs();i++) { Document doc = ir.document(i); if(thisDocShouldBeDeleted(doc)) { ir.delete(docNum); // - I need the docNum for doc. } } How do I get the docNum for IndexReader.delete() function in the above case? Is there a API function I am missing? I am working with a merged The document number is the variable i in this case. If the document number is the variable i (enumerated from numDocs()), what's the difference between numDocs() and maxDoc() in this case? I was previously under the impression that the internal docNum might be different to the counter. Iirc, the difference between maxDoc() + 1 and numDocs() is the number of deleted documents. Check the javadocs to be sure. index over different segments so the docNum might not be in running sequence with the counter i. In general, is there a better way to do this sort of thing? This code: Document doc = ir.document(i); normally retrieves all the stored fields of the document and that is quite costly. In case you know that the document(s) to be deleted match(es) a Term, it's better to use IndexReader.delete(Term). I'm doing something akin to a rangeQuery, where I delete documents within a certain range (in addition to other criteria). Is it better to do a query on the range, mark all the docNums getting them with Hits.id(), and then retrieve docs and test for deletion according to that? In that case it is faster to use the Terms generated inside the range query and then use these on IndexReader.delete(Term). To generate the terms have a look at the source code of the rewrite() method of RangeQuery here: http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/search/ Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting the document number (with IndexReader)
On Thursday 26 January 2006 19:44, Chris Hostetter wrote: : The document number is the variable i in this case. : If the document number is the variable i (enumerated from numDocs()), : what's the difference between numDocs() and maxDoc() in this case? I : was previously under the impression that the internal docNum might be : different to the counter. Iterating between 1 and maxDoc-1 will give you the range of all possible doc ids, but some of those docs may have already been deleted. I believe that is what you want to do. ... you can check if a doc is deleted using IndexReader.isDeleted(i) numDocs is implimented as maxDocs() - deletedDocs.count(), so i don't think it ever makes sese to iterate up to numDocs. : I'm doing something akin to a rangeQuery, where I delete documents : within a certain range (in addition to other criteria). Is it better : to do a query on the range, mark all the docNums getting them with : Hits.id(), and then retrieve docs and test for deletion according to : that? Take a look at the way RangeFilter.bits() is implimented. if you cut/paste that code and replace the call to bits.set(termDocs.doc()); with reader.delete(termDocs.doc()) I think you've have exactly what you want. Or, since cutting/pasting code is A Bad Thing from a maintenence/bug fixing standpoint, you could just call RangeFilter.bits(reader) yourself, and then iterate of the set bits and call delete on each one. Perhaps an extra rewrite method with a term visitor argument? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: encoding
Hello, On Jan 26, 2006, at 12:01, John Haxby wrote: I have a perl script here that I used to generate downgrading table for a C program. I can let you have the perl script as is, but if there's enough interest(*) I'll use it to generate, say, CompoundAsciiFilter since it converts compound characters like á, æ, ffi (ffi-ligature, in case it doesn't display) to a, ae and ffi. It's actually built from http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt so it winds up having nearly 1200 entries. An earlier version converted all compound characters to their constient parts, but this version just converts characters that are made up entirely of ASCII and modifiers. I would love to see this. I presently have a somewhat unwieldy conversion table [1] that I would love to get ride of :)) Cheers -- PA, Onnay Equitursay http://alt.textdrive.com/ [1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Dumb question: does the 64-bit compiler (javac) generate different code than the 32-bit version, or is it just the jvm that matters? My reported speedups were soley from using the 64-bit jvm with jar files from the 32-bit compiler. Peter On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote: Nice speedup! The extra registers in 64 bit mode hay have helped a little too. -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Correction: make that 285 qps :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
There is no difference in bytecode... the whole difference is just in the underlying JVM. -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Dumb question: does the 64-bit compiler (javac) generate different code than the 32-bit version, or is it just the jvm that matters? My reported speedups were soley from using the 64-bit jvm with jar files from the 32-bit compiler. Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Peter, Wow, the speed up in impressive! But may I ask what did you do to achieve 135 queries/sec prior to the JVM swich? ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Correction: make that 285 qps :) On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Thanks all very much. Peter On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote: Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Ray, The short answer is that you can make Lucene blazingly fast by using advice and design principles mentioned in this forum and of course reading 'Lucene in Action'. For example, use a 'content' field for searching all fields (vs mutli-field search), put all your stored data in one field, understand the cost of numeric search and sorting. On the platform side, go multi-CPU and of course 64-bit if possible :) Also, I would venture to guess that a lot of search bottlenecks have nothing to do with Lucene, but rather in the infrastructure around it. For example, how does your client interface to the search engine? My results use a plain socket interface between client and server (one connection for queries, another for results), using a simple query/results data format. Introducing other web infrastructures invites degradation in performance, too. I've a bit of experience with search engines, but I'm obviously still learning thanks to this group. Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Peter, Wow, the speed up in impressive! But may I ask what did you do to achieve 135 queries/sec prior to the JVM swich? ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Correction: make that 285 qps :) On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Thanks all very much. Peter On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote: Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problem updating a document: no segments file?
Hello, I have a couple instances of lucene. I just altered on implementation and now its not keeping a segments file. while indexing occurs, there is a segment file.but once its done, there isn't.all the other indexes have one. the problem comes when i try to update a document, it says segments file not found and that stops it.this code was working fine on my development box, but now i go to production its not keeping that segments file.and, it searches just fine.i can reindex over and over, and it keeps disappearing. any ideas? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Paul, Thanks for the advice! But for the 100+queries/sec on a 32-bit platfrom, did you end up applying other patches? or use different FSDirectory implementations? Thanks! ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Ray, The short answer is that you can make Lucene blazingly fast by using advice and design principles mentioned in this forum and of course reading 'Lucene in Action'. For example, use a 'content' field for searching all fields (vs mutli-field search), put all your stored data in one field, understand the cost of numeric search and sorting. On the platform side, go multi-CPU and of course 64-bit if possible :) Also, I would venture to guess that a lot of search bottlenecks have nothing to do with Lucene, but rather in the infrastructure around it. For example, how does your client interface to the search engine? My results use a plain socket interface between client and server (one connection for queries, another for results), using a simple query/results data format. Introducing other web infrastructures invites degradation in performance, too. I've a bit of experience with search engines, but I'm obviously still learning thanks to this group. Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Peter, Wow, the speed up in impressive! But may I ask what did you do to achieve 135 queries/sec prior to the JVM swich? ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Correction: make that 285 qps :) On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Thanks all very much. Peter On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote: Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Ray, The 135 qps rate was using the standard FSDirectory in 1.9. Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Paul, Thanks for the advice! But for the 100+queries/sec on a 32-bit platfrom, did you end up applying other patches? or use different FSDirectory implementations? Thanks! ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Ray, The short answer is that you can make Lucene blazingly fast by using advice and design principles mentioned in this forum and of course reading 'Lucene in Action'. For example, use a 'content' field for searching all fields (vs mutli-field search), put all your stored data in one field, understand the cost of numeric search and sorting. On the platform side, go multi-CPU and of course 64-bit if possible :) Also, I would venture to guess that a lot of search bottlenecks have nothing to do with Lucene, but rather in the infrastructure around it. For example, how does your client interface to the search engine? My results use a plain socket interface between client and server (one connection for queries, another for results), using a simple query/results data format. Introducing other web infrastructures invites degradation in performance, too. I've a bit of experience with search engines, but I'm obviously still learning thanks to this group. Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Peter, Wow, the speed up in impressive! But may I ask what did you do to achieve 135 queries/sec prior to the JVM swich? ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Correction: make that 285 qps :) On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Thanks all very much. Peter On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote: Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Two strange things in Lucene
Since I didn't find anything in the log from log4j I did a kill -3 on the process and found two very interesting things: Almost all multisearcher threads were in this state: MultiSearcher thread #1 daemon prio=10 tid=0x01900960 nid=0x81442c waiting for monitor entry [0xfd7d269ff000..0xfd7d269ffb50] at java.util.Vector.size(Vector.java:270) - waiting to lock 0xfd7f0114ea28 (a java.util.Vector) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init (BooleanQuery. java:95) I don't know about this one, but guessing that it just happens to be a normal state of the system when you killed the process. *shrugs* You probably missed the -3 parameter. This just dumps the state of the virtual machine, it doesn't actually kill the JVM. Thus I believe that this is not a normal state. And, additionally I found another stacktrace in the stdout-log which I find interesting: Exception in thread MultiSearcher thread #1 org.apache.lucene.search.BooleanQuery$TooManyClauses This is a typical occurrence when using Query's that expand such as WildcardQuery, RangeQuery, FuzzyQuery, etc. If users are doing queries like a* and there are over 1024 terms that start with a then you will, by default, blow up WildcardQuery's expansion into a BooleanQuery. You can up that limit on BooleanQuery, or disallow those types of queries perhaps. Ok, I'll see what I can do. Thanks!
How does the lucene normalize the score?
Hi, I want to know how the lucene normalizes the score. I see hits class has this function to get each document's score. But i dont know how lucene calculates the normalized score and in the Lucene in action, it only said normalized score of the nth top scoring docuemnts. -- Regards Jiang Xing
Re: Performance tips?
I seem to say this a lot :), but, assuming your OS has a decent filesystem cache, try reducing your JVM heapsize, using an FSDirectory instead of RAMDirectory, and see if your filesystem cache does ok. If you have 12GB, then you should have enough RAM to hold both the old and new indexes during the switchover. -chris On 1/26/06, Daniel Pfeifer [EMAIL PROTECTED] wrote: Hi, Got more questions regarding Lucene and this time it's about performance ;-) We currently are using RAMDirectories to read our Indexes. This has now become a problem since our index has grown to appx 5GB of RAM and the machine we are running on only has 12GB of RAM and everytime we refresh the RAMDirectories we of course keep the old Searchables so that there is no service interruption. This means we consume 10GB of RAM from time to time. One solution is of course to stop using RAM and read anything from disk but I can imagine that the performance will decrease significantly. Is there any workaround you can think of? Perhaps a hybrid between FSDirectory and RAMDirectory. For example that only frequently searched documents are cached and the others are read from disk? Well, I'd appreciate any ideas at all! Thanks /Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]