Lucene or Nutch ?
Hi All, I have to develop a protoype of a search/indexation system with the following characteristics, 1) High volume of data indexation but only with add and delete functionality (approximatively 10 PDF) = scalable architecture HDFS seems good. 2) Specific analysis chain and a given set of meta-data indexation. 3) Language Recognition 4) No graphical interface for searching is needed, no crawling is needed, Indexation and Search are performed with HTTP Request to a Servlet What is the best starting choice for this : Lucene or Nutch ? As far as I know Lucene is a good choice for 2 and 4, Nutch is a better choice for 1 and 3. Is Nutch as configurable as Lucene regarding the indexation and search process and is it possible to write plug-in for specific analysis ? Bruno ___ Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international. Téléchargez sur http://fr.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FS lock on NFS mounted filesystem for indexing
Hi All, I got a strange problem during the indexer process running on Redhat ES4 Linux machine .. java.io.FileNotFoundException: /u01/export/index/books/_2s.fnm (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:212) at org.apache.lucene.store.FSIndexInput$Descriptor.init(FSDirectory.java:425) at org.apache.lucene.store.FSIndexInput.init(FSDirectory.java:434) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:56) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:674) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658) at org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:646) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:453) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:436) After looking through the mailinglist probaly I am thinking that when the indexer runs on NFS mounted filesystem then its a problem of filesystem locking because I run the indexer and parallaly I also search on the index. I am using Lucene 1.9.1 version with JDK1.5 VM on Redhat ES4 64 bit Linux Dual Core Opteron processor. Any information will be very greatful. Thanks, supriya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Optimize completely in memory with a FSDirectory?
Hi all, I have a question about memory/fileio settings and the FSDirectory. The setMaxBufferedDocs and related parameters help a lot already to fully exploit my RAM when indexing, but since I'm running a fairly small index of around 4 docs and I'm optimizing it relatively often, I was wondering if there is any way to enforce complete in-memory optimization. The stupid thing is that even with a maxBufferedDocs of 5, it still writes lots of tiny files to disk (together almost 2-3 times the size of the index), and the disk io skyrockets for a few seconds. I have enough memory to hold the index many times over, so it really shouldn't be a problem there, and it would be so much faster (I have to think). Any hints? Best regards, Max Pfingsthorn Hippo Oosteinde 11 1017WT Amsterdam The Netherlands Tel +31 (0)20 5224466 - [EMAIL PROTECTED] / www.hippo.nl - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Which Analyzer to use when searching on Keyword fields
Hi, I am using lucene 1.4.3. Some of my fields are indexed as Keywords. I also have subclassed Analyzer inorder to put stemming etc. I am not sure if the input is tokenized when I am searching on keyword fields; I don't want it to be. Do I need to have a special case in the overridden method (Analyzer.tokenStream() ) to handle keyword fields? I've noticed that there's a KeywordTokenizer now in the API, but its not there for lucene 1.4.3. If I was using 1.9, I could probably determine if the field was a keyword one and then return a KeywordTokenizer(Reader), but I am using 1.4.3. Any advice is appreciated. -Venu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Which Analyzer to use when searching on Keyword fields
Venu, I presume you're asking about what Analyzer to use with QueryParser. QueryParser analyzes all term text, but you can fake it for Keyword (non-tokenized) fields by using PerFieldAnalyzerWrapper, specifying the KeywordAnalyzer for the fields you indexed as such. The KeywordAnalyzer code will work with 1.4.3, so just grab that class and put it into your project. A couple of variations of it are also included with the Lucene in Action code. Erik On Apr 5, 2006, at 7:52 AM, Satuluri, Venu_Madhav wrote: Hi, I am using lucene 1.4.3. Some of my fields are indexed as Keywords. I also have subclassed Analyzer inorder to put stemming etc. I am not sure if the input is tokenized when I am searching on keyword fields; I don't want it to be. Do I need to have a special case in the overridden method (Analyzer.tokenStream() ) to handle keyword fields? I've noticed that there's a KeywordTokenizer now in the API, but its not there for lucene 1.4.3. If I was using 1.9, I could probably determine if the field was a keyword one and then return a KeywordTokenizer(Reader), but I am using 1.4.3. Any advice is appreciated. -Venu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Which Analyzer to use when searching on Keyword fields
You understood me right, Erik. Your solution is working well, thanks. Venu -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 05, 2006 6:03 PM To: java-user@lucene.apache.org Subject: Re: Which Analyzer to use when searching on Keyword fields Venu, I presume you're asking about what Analyzer to use with QueryParser. QueryParser analyzes all term text, but you can fake it for Keyword (non-tokenized) fields by using PerFieldAnalyzerWrapper, specifying the KeywordAnalyzer for the fields you indexed as such. The KeywordAnalyzer code will work with 1.4.3, so just grab that class and put it into your project. A couple of variations of it are also included with the Lucene in Action code. Erik On Apr 5, 2006, at 7:52 AM, Satuluri, Venu_Madhav wrote: Hi, I am using lucene 1.4.3. Some of my fields are indexed as Keywords. I also have subclassed Analyzer inorder to put stemming etc. I am not sure if the input is tokenized when I am searching on keyword fields; I don't want it to be. Do I need to have a special case in the overridden method (Analyzer.tokenStream() ) to handle keyword fields? I've noticed that there's a KeywordTokenizer now in the API, but its not there for lucene 1.4.3. If I was using 1.9, I could probably determine if the field was a keyword one and then return a KeywordTokenizer(Reader), but I am using 1.4.3. Any advice is appreciated. -Venu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
searching offline
Hi, I have a large collection of text documents that I want to search using lucene. Is there any command line utility that will allow me to search this static collection of documents? Writing one is an option but I want to know if anyone has already done this. Thanks in advance, Delip - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searching offline
Red Piranha: http://red-piranha.sourceforge.net/ -Original Message- From: Delip Rao [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 05, 2006 6:53 PM To: java-user@lucene.apache.org Subject: searching offline Hi, I have a large collection of text documents that I want to search using lucene. Is there any command line utility that will allow me to search this static collection of documents? Writing one is an option but I want to know if anyone has already done this. Thanks in advance, Delip - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searching offline
http://regain.sourceforge.net/ ? - Original Message - From: Delip Rao [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, April 05, 2006 2:23 PM Subject: searching offline Hi, I have a large collection of text documents that I want to search using lucene. Is there any command line utility that will allow me to search this static collection of documents? Writing one is an option but I want to know if anyone has already done this. Thanks in advance, Delip - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re[4]: OutOfMemory with search(Query, Sort)
On 4/5/06, Artem Vasiliev [EMAIL PROTECTED] wrote: The int[] array here contains references to String[] and to populate it still all the field values need to be loaded and compared/sorted Terms are stored and iterated in sorted order, so no sorting needs to be done. It's still the case that all the terms for that field need to be iterated over though. Another approach might be to store term vectors and retrieve the term only from documents matching a particular query. It might be slower per query, but wouldn't have the overhead of populating the int[] -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
WRITE_LOCK_TIMEOUT
Hi. Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded and there is no way to set it from outside? I've seen a check-in in the CVS from a few days ago which added getters/setters for this, but ... there is no release containing this, right? So, my question is: Is it save to use a nightly build for production use? Thanks, cug smime.p7s Description: S/MIME cryptographic signature
Re: Lucene or Nutch ?
On 4/5/06, Bruno Grilheres [EMAIL PROTECTED] wrote: 1) High volume of data indexation but only with add and delete functionality (approximatively 10 PDF) = scalable architecture HDFS seems good. 2) Specific analysis chain and a given set of meta-data indexation. 3) Language Recognition 4) No graphical interface for searching is needed, no crawling is needed, Indexation and Search are performed with HTTP Request to a Servlet What is the best starting choice for this : Lucene or Nutch ? As far as I know Lucene is a good choice for 2 and 4, Nutch is a better choice for 1 and 3. Solr would also be good for 2 and 4 As far as 1, what type of scalability requirements are we talking? (# documents, size of docs, etc) -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WRITE_LOCK_TIMEOUT
Hi. Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded and there is no way to set it from outside? I've seen a check-in in the CVS from a few days ago which added getters/setters for this, but ... there is no release containing this, right? So, my question is: Is it save to use a nightly build for production use? Thanks, cug Or, as I suggested a couple of days ago, a 1.9.2 release could be offered. Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WRITE_LOCK_TIMEOUT
On 05.04.2006, at 17:15 Uhr, Bill Janssen wrote: Or, as I suggested a couple of days ago, a 1.9.2 release could be offered. Would be a good idea, because the current nightly builds have a lot of deprecated methods removed which where available in 1.9.1. Lot of work just for this ... :-( cug smime.p7s Description: S/MIME cryptographic signature
Lucene Document order not being maintained?
I'm using Lucene 1.9.1, and I'm seeing some odd behavior that I hope someone can help me with. My application counts on Lucene maintaining the order of the documents exactly the same as how I insert them. Lucene is supposed to maintain document order, even across index merges, correct? My indexing process works as follows (and some of this is hold-over from the time before lucene had a compound file format - so bear with me) I open up a File based index - using a merge factor of 90, and in my current test, the compound index format. When I have added 100,000 documents, I close this index, and start on a new index. I continue this until I'm done with all of the documents. Then, as a last step, I open up a new empty index, and I call addIndexes(Directory[]) - and I pass in the directories in the same order that I created them. This allows me to use higher merge factors without running into file handle issues, and without having to call optimize. The problem that I am seeing right now, is that when I look into my large combined index with Luke, Document number 899 is the 899th document that I added. However, Document 900 is the 49860th document that I added. This continues until Document 910, where it suddenly jumps to the 99720th document. Is this a bug, or am I misusing something in the API? Thanks, Dan -- Daniel Armbrust Biomedical Informatics Mayo Clinic Rochester daniel.armbrust(at)mayo.edu http://informatics.mayo.edu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene sorting
Hi, I need to change the lucene sorting to give just a bit more relevance to the recent documents (but i don't want to sort by date). I'd like to mix the lucene score with the date of the document. I'm following the example in Lucene in Action, chapter 6. I'm trying to extends the SortComparatorSource but I don't understand how to get the lucene score of the document. Do you have some idea about how to solve my problem? Or do you know where get some more example on custom sorting? Thanks, Gian Marco - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene or Nutch ?
Thanks for your answer, I was not aware of the SOLR project, There was a big typo here, I meant less than 10 Go of PDF files per day during one month = i.e. less than 300 Go of PDF files. I made some tests with PDF files, 100Mo or Native PDF are converted to 3Mo of index in lucene [The text was indexed but not stored]. Bruno Yonik Seeley wrote: On 4/5/06, Bruno Grilheres [EMAIL PROTECTED] wrote: 1) High volume of data indexation but only with add and delete functionality (approximatively 10 PDF) = scalable architecture HDFS seems good. 2) Specific analysis chain and a given set of meta-data indexation. 3) Language Recognition 4) No graphical interface for searching is needed, no crawling is needed, Indexation and Search are performed with HTTP Request to a Servlet What is the best starting choice for this : Lucene or Nutch ? As far as I know Lucene is a good choice for 2 and 4, Nutch is a better choice for 1 and 3. Solr would also be good for 2 and 4 As far as 1, what type of scalability requirements are we talking? (# documents, size of docs, etc) -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international. Téléchargez sur http://fr.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene or Nutch ?
On 4/5/06, Bruno Grilheres [EMAIL PROTECTED] wrote: Thanks for your answer, I was not aware of the SOLR project, There was a big typo here, I meant less than 10 Go of PDF files per day during one month = i.e. less than 300 Go of PDF files. Sorry, I'm not sure what the Go abbreviation is... I assume it's Gigabytes (GB or GiB)? If so, that's a lot. I'd probably go with Nutch. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
: exactly the same as how I insert them. Lucene is supposed to maintain : document order, even across index merges, correct? Lucene definitely maintains index order for document additions -- but i don't know if any similar claim has been made about merging whole indexes. : this until I'm done with all of the documents. Then, as a last step, I : open up a new empty index, and I call addIndexes(Directory[]) - and I : pass in the directories in the same order that I created them. ... : The problem that I am seeing right now, is that when I look into my : large combined index with Luke, Document number 899 is the 899th : document that I added. However, Document 900 is the 49860th document : that I added. This continues until Document 910, where it suddenly : jumps to the 99720th document. As I said, i'm not sure if it's a bug or undefined behavior, but can you post a self contained JUnit test demonstrating this? -- that way people can look at exactly what is going on (if it is a bug). -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene sorting
I don't know if there is anyway for a Custom Sort to access the lucene score -- but another approach that works very well is to use the FunctionQuery classes from Solr... http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/package-summary.html ...you can make a FunctionQuery object that scores things linerarly (or reciprocally, or any other funciton you impliment in java) based on the value of any field -- and then add that query to a BooleanQuery along with your orriginal query and use the boost to determine how much of an influence it has on your final score. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser error + solution
Daniel you are very clever! Your solution remind me this: No temptation has overtaken you but such as is common to man; and God is faithful, who will not allow you to be tempted beyond what you are able, but with the temptation will provide the way of escape also, so that you will be able to endure it. 1 Corinthians 10:13 (New American Standard Version) Well done Erik! Original Message Follows From: Daniel Noll [EMAIL PROTECTED] Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: QueryParser error + solution Date: Wed, 05 Apr 2006 14:26:20 +1000 miki sun wrote: Thanks Erik and Michael! I copied some code from demo.SearchFiles.java, I do not have a more clearer tracing message. Now it works. But do you have a better way than this: [snip] Something like this? String str = Really bad query string: lots of evil stuff!; str = QueryParser.escape(str); Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://www.nuix.com.au/Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimize completely in memory with a FSDirectory?
On Mittwoch 05 April 2006 13:02, Max Pfingsthorn wrote: The setMaxBufferedDocs and related parameters help a lot already to fully exploit my RAM when indexing, but since I'm running a fairly small index of around 4 docs and I'm optimizing it relatively often, I was wondering if there is any way to enforce complete in-memory optimization. Maybe you could use a RAMDirectory and write it to disk using IndexWriter.addIndexes() from time to time? Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
Chris Hostetter wrote: : exactly the same as how I insert them. Lucene is supposed to maintain : document order, even across index merges, correct? Lucene definitely maintains index order for document additions -- but i don't know if any similar claim has been made about merging whole indexes. : this until I'm done with all of the documents. Then, as a last step, I : open up a new empty index, and I call addIndexes(Directory[]) - and I : pass in the directories in the same order that I created them. ... : The problem that I am seeing right now, is that when I look into my : large combined index with Luke, Document number 899 is the 899th : document that I added. However, Document 900 is the 49860th document : that I added. This continues until Document 910, where it suddenly : jumps to the 99720th document. As I said, i'm not sure if it's a bug or undefined behavior, but can you post a self contained JUnit test demonstrating this? -- that way people can look at exactly what is going on (if it is a bug). -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Well, I set out to write JUnit test case to quickly show this... but I'm having a heck of a time doing it. With relatively small numbers of documents containing very few fields... I haven't been able to recreate the out-of-order problem. However, with my real process, with a ton more data, I can recreate it every single time I index (it even gets the same documents out of order, consistently). I'll continue to try to generate a test case that gets the docs out of order... but if someone in the know could answer authoritatively whether or not lucene is supposed to maintain document order when you merge multiple indexes together, that would be great. Thanks, Dan -- Daniel Armbrust Biomedical Informatics Mayo Clinic Rochester daniel.armbrust(at)mayo.edu http://informatics.mayo.edu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote: I'll continue to try to generate a test case that gets the docs out of order... but if someone in the know could answer authoritatively whether I browsed the code for IndexWriter.addIndexes(Dir[]), and it looks like it should preserve order. The directories are added in order, and the segments for each directory are added in order. The merging code is shared, so that shouldn't do anything different than normal segment merges. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote: I haven't been able to recreate the out-of-order problem. However, with my real process, with a ton more data, I can recreate it every single time I index (it even gets the same documents out of order, consistently). If you have enough file handles, you can test if it's a Lucene problem or your app by opening a MultiReader over all the indexes and testing if the documents are in the order you think they are *before* merging. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
Yonik Seeley wrote: On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote: I'll continue to try to generate a test case that gets the docs out of order... but if someone in the know could answer authoritatively whether I browsed the code for IndexWriter.addIndexes(Dir[]), and it looks like it should preserve order. The directories are added in order, and the segments for each directory are added in order. The merging code is shared, so that shouldn't do anything different than normal segment merges. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Thanks for checking Yonik. I'm fairly certain that this is a lucene bug then - I will try to come up with a reproduceable test case. My load code is pretty simple... whenever I create a new document, I put in a field that contains a counter of the load order. When I look at the individual indexes, things are fine - but after it merges them, I get a significant percentage of documents which have been reordered. One other thing I can look into - I've been building these indexes on a 64 bit linux machine, using a 64 bit JVM. I need to see if the same error happens on 32 bit windows Dan -- Daniel Armbrust Biomedical Informatics Mayo Clinic Rochester daniel.armbrust(at)mayo.edu http://informatics.mayo.edu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
: Well, I set out to write JUnit test case to quickly show this... but : I'm having a heck of a time doing it. With relatively small numbers of : documents containing very few fields... I haven't been able to recreate : the out-of-order problem. However, with my real process, with a ton : more data, I can recreate it every single time I index (it even gets the : same documents out of order, consistently). it's very possible that the problem is specific to large numbers of documents/indexes, or that it's specific to FSDirectory - so if you can't reproduce with a handfull of docs on a RAMDirectory don't shy away from making a test case that creates 10 1GB indexes in ./test-doc-order-on-merge or something like that if it's the only way to reproduce the problem. just warn us if it it's not obvious from the code that it does that :) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
Dan Armbrust wrote: My indexing process works as follows (and some of this is hold-over from the time before lucene had a compound file format - so bear with me) I open up a File based index - using a merge factor of 90, and in my current test, the compound index format. When I have added 100,000 documents, I close this index, and start on a new index. I continue this until I'm done with all of the documents. Then, as a last step, I open up a new empty index, and I call addIndexes(Directory[]) - and I pass in the directories in the same order that I created them. This allows me to use higher merge factors without running into file handle issues, and without having to call optimize. As others have noted, this should work correctly. I assume that your merge factor when calling addIndexes() is less than 90. If it's 90, then what you're doing is the same as Lucene would automatically do. I think you could save yourself a lot of trouble if you simply lowered your merge factor substantially and then indexed everything in one pass. To make things go faster, set maxBufferedDocs=100 or larger. This should be as fast as what you're doing now and a lot simpler. Or is that the part where I was supposed to bear with you? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
On 4/5/06, Doug Cutting [EMAIL PROTECTED] wrote: As others have noted, this should work correctly. One slight oddity I noticed with addIndexes(Dir[]) is that merging starts at one past the first new segment added (not the first new segment). It doesn't seem like that should hurt much though. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Out of interest, does indexing time speed up much on 64-bit hardware? I was able to speed up indexing on 64-bit platform by taking advantage of the larger address space to parallelize the indexing process. One thread creates index segments with a set of RAMDirectories and another thread merges the segments to disk with 'addIndexes'. This resulted in a speed improvement of 27%. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
Yonik Seeley wrote: For your test case, try lowering numbers, such as maxBufferedDocs=2, mergeFactor=2 or 3 to create more segments more quickly and cause more merges with fewer documents. Good suggestion. A merge factor of 2 made it happen much more quickly. Bug is filed: http://issues.apache.org/jira/browse/LUCENE-540 JUnit test case is attached (although it may not be in the proper format for lucene - but I think its pretty straight forward) Dan -- Daniel Armbrust Biomedical Informatics Mayo Clinic Rochester daniel.armbrust(at)mayo.edu http://informatics.mayo.edu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
Doug Cutting wrote: I assume that your merge factor when calling addIndexes() is less than 90. If it's 90, then what you're doing is the same as Lucene would automatically do. I think you could save yourself a lot of trouble if you simply lowered your merge factor substantially and then indexed everything in one pass. To make things go faster, set maxBufferedDocs=100 or larger. This should be as fast as what you're doing now and a lot simpler. Or is that the part where I was supposed to bear with you? Doug Yep. This code was written when I had to index tons of stuff on linux, and was constantly running into file handle issues (even with low merge factors). I ended up writing a wrapper for lucene that handled it all for me, and I've just been reusing it. Then today, I ran into this issue. It may be time to rework some of the wrapper to take advantage of the lucene updates :) Dan -- Daniel Armbrust Biomedical Informatics Mayo Clinic Rochester daniel.armbrust(at)mayo.edu http://informatics.mayo.edu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote: Yonik Seeley wrote: For your test case, try lowering numbers, such as maxBufferedDocs=2, mergeFactor=2 or 3 to create more segments more quickly and cause more merges with fewer documents. Good suggestion. A merge factor of 2 made it happen much more quickly. Bug is filed: http://issues.apache.org/jira/browse/LUCENE-540 Thanks Dan, I'll look into it tonight, as promised. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
Ah Ha! I found the problem. SegmentInfos.read(Directory directory) reads the segment info in reverse order! I gotta go home now... I'll look into the right fix later (it depends on what else uses that method...) FYI, I managed to reproduce it with only 3 documents in each index. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
Spoke too soon... the loop counter goes down to zero, but it looks like the segments are added in order. for (int i = input.readInt(); i 0; i--) { // read segmentInfos SegmentInfo si = new SegmentInfo(input.readString(), input.readInt(), directory); addElement(si); } On 4/5/06, Yonik Seeley [EMAIL PROTECTED] wrote: Ah Ha! I found the problem. SegmentInfos.read(Directory directory) reads the segment info in reverse order! I gotta go home now... I'll look into the right fix later (it depends on what else uses that method...) FYI, I managed to reproduce it with only 3 documents in each index. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
I realized what the real problem was during the drive home. merged segments are added after all other segments, instead of the spot the original segments resided. I'll propose a patch soon... -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
OK, the following patch seems to work for me! You might want to try it out on your larger test Dan. The first part probably isn't necessary (the base=start instead of start+1), but the second part is. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server Index: org/apache/lucene/index/IndexWriter.java === --- org/apache/lucene/index/IndexWriter.java(revision 391084) +++ org/apache/lucene/index/IndexWriter.java(working copy) @@ -569,7 +569,7 @@ // merge newly added segments in log(n) passes while (segmentInfos.size() start+mergeFactor) { - for (int base = start+1; base segmentInfos.size(); base++) { + for (int base = start; base segmentInfos.size(); base++) { int end = Math.min(segmentInfos.size(), base+mergeFactor); if (end-base 1) mergeSegments(base, end); @@ -710,9 +710,9 @@ infoStream.println( into +mergedName+ (+mergedDocCount+ docs)); } -for (int i = end-1; i = minSegment; i--) // remove old infos add new +for (int i = end-1; i minSegment; i--) // remove old infos add new segmentInfos.remove(i); -segmentInfos.addElement(new SegmentInfo(mergedName, mergedDocCount, +segmentInfos.set(minSegment, new SegmentInfo(mergedName, mergedDocCount, directory)); // close readers before we attempt to delete now-obsolete segments - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
addIndexes(Dir[]) was the only user of mergeSegments() that passed an endpoint that wasn't the end of the segment list, and hence the only caller to mergeSegments() that will see a change of behavior. Given that, I feel comfortable enough to commit this. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server On 4/5/06, Yonik Seeley [EMAIL PROTECTED] wrote: OK, the following patch seems to work for me! You might want to try it out on your larger test Dan. The first part probably isn't necessary (the base=start instead of start+1), but the second part is. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server Index: org/apache/lucene/index/IndexWriter.java === --- org/apache/lucene/index/IndexWriter.java(revision 391084) +++ org/apache/lucene/index/IndexWriter.java(working copy) @@ -569,7 +569,7 @@ // merge newly added segments in log(n) passes while (segmentInfos.size() start+mergeFactor) { - for (int base = start+1; base segmentInfos.size(); base++) { + for (int base = start; base segmentInfos.size(); base++) { int end = Math.min(segmentInfos.size(), base+mergeFactor); if (end-base 1) mergeSegments(base, end); @@ -710,9 +710,9 @@ infoStream.println( into +mergedName+ (+mergedDocCount+ docs)); } -for (int i = end-1; i = minSegment; i--) // remove old infos add new +for (int i = end-1; i minSegment; i--) // remove old infos add new segmentInfos.remove(i); -segmentInfos.addElement(new SegmentInfo(mergedName, mergedDocCount, +segmentInfos.set(minSegment, new SegmentInfo(mergedName, mergedDocCount, directory)); // close readers before we attempt to delete now-obsolete segments - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Document order not being maintained?
Thanks guys as always... lucene (and especially the people behind it) are top notch. Less than 6 hours from the time I figured out that the bug was in Lucene (and not my code, which is usually the case) - and its already fixed (I'm going to assume - I'll test it tomorrow when I get to work) Amazing. Thanks again, Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: highlighting - fuzzy search
mark harwood wrote: Isn't that what Query.extractTerms is for? Isn't it implimented by all primitive Queries?.. As of last week, yes. I changed the SpanQueries to implement this method and then refactored the Highlighter package's QueryTermExtractor to make use of this (it radically simplified the code in there). This change to rely on extractTerms also means that the highlighter now works properly with classes like FilteredQuery. Very nice. Yet another point I can add onto the huge list of reasons our app should update Lucene. :-) Although I'd rather not rewrite the query first, it feels like it would use more memory than an extractTerms(IndexReader) method would. Maybe I'm wrong on this, though. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://www.nuix.com.au/Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]