Lucene or Nutch ?

2006-04-05 Thread Bruno Grilheres
Hi All, I have to develop a protoype of a search/indexation system with the following characteristics, 1) High volume of data indexation but only with add and delete functionality (approximatively 10 PDF) = scalable architecture HDFS seems good. 2) Specific analysis chain and a given set of

FS lock on NFS mounted filesystem for indexing

2006-04-05 Thread Supriya Kumar Shyamal
Hi All, I got a strange problem during the indexer process running on Redhat ES4 Linux machine .. java.io.FileNotFoundException: /u01/export/index/books/_2s.fnm (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at

Optimize completely in memory with a FSDirectory?

2006-04-05 Thread Max Pfingsthorn
Hi all, I have a question about memory/fileio settings and the FSDirectory. The setMaxBufferedDocs and related parameters help a lot already to fully exploit my RAM when indexing, but since I'm running a fairly small index of around 4 docs and I'm optimizing it relatively often, I was

Which Analyzer to use when searching on Keyword fields

2006-04-05 Thread Satuluri, Venu_Madhav
Hi, I am using lucene 1.4.3. Some of my fields are indexed as Keywords. I also have subclassed Analyzer inorder to put stemming etc. I am not sure if the input is tokenized when I am searching on keyword fields; I don't want it to be. Do I need to have a special case in the overridden method

Re: Which Analyzer to use when searching on Keyword fields

2006-04-05 Thread Erik Hatcher
Venu, I presume you're asking about what Analyzer to use with QueryParser. QueryParser analyzes all term text, but you can fake it for Keyword (non-tokenized) fields by using PerFieldAnalyzerWrapper, specifying the KeywordAnalyzer for the fields you indexed as such. The KeywordAnalyzer

RE: Which Analyzer to use when searching on Keyword fields

2006-04-05 Thread Satuluri, Venu_Madhav
You understood me right, Erik. Your solution is working well, thanks. Venu -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 05, 2006 6:03 PM To: java-user@lucene.apache.org Subject: Re: Which Analyzer to use when searching on Keyword fields Venu,

searching offline

2006-04-05 Thread Delip Rao
Hi, I have a large collection of text documents that I want to search using lucene. Is there any command line utility that will allow me to search this static collection of documents? Writing one is an option but I want to know if anyone has already done this. Thanks in advance, Delip

RE: searching offline

2006-04-05 Thread Satuluri, Venu_Madhav
Red Piranha: http://red-piranha.sourceforge.net/ -Original Message- From: Delip Rao [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 05, 2006 6:53 PM To: java-user@lucene.apache.org Subject: searching offline Hi, I have a large collection of text documents that I want to search using

Re: searching offline

2006-04-05 Thread gekkokid
http://regain.sourceforge.net/ ? - Original Message - From: Delip Rao [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, April 05, 2006 2:23 PM Subject: searching offline Hi, I have a large collection of text documents that I want to search using lucene. Is there any

Re: Re[4]: OutOfMemory with search(Query, Sort)

2006-04-05 Thread Yonik Seeley
On 4/5/06, Artem Vasiliev [EMAIL PROTECTED] wrote: The int[] array here contains references to String[] and to populate it still all the field values need to be loaded and compared/sorted Terms are stored and iterated in sorted order, so no sorting needs to be done. It's still the case that all

WRITE_LOCK_TIMEOUT

2006-04-05 Thread Guido Neitzer
Hi. Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded and there is no way to set it from outside? I've seen a check-in in the CVS from a few days ago which added getters/setters for this, but ... there is no release containing this, right? So, my question is: Is it

Re: Lucene or Nutch ?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Bruno Grilheres [EMAIL PROTECTED] wrote: 1) High volume of data indexation but only with add and delete functionality (approximatively 10 PDF) = scalable architecture HDFS seems good. 2) Specific analysis chain and a given set of meta-data indexation. 3) Language Recognition 4) No

Re: WRITE_LOCK_TIMEOUT

2006-04-05 Thread Bill Janssen
Hi. Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded and there is no way to set it from outside? I've seen a check-in in the CVS from a few days ago which added getters/setters for this, but ... there is no release containing this, right? So, my question is:

Re: WRITE_LOCK_TIMEOUT

2006-04-05 Thread Guido Neitzer
On 05.04.2006, at 17:15 Uhr, Bill Janssen wrote: Or, as I suggested a couple of days ago, a 1.9.2 release could be offered. Would be a good idea, because the current nightly builds have a lot of deprecated methods removed which where available in 1.9.1. Lot of work just for this ... :-(

Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
I'm using Lucene 1.9.1, and I'm seeing some odd behavior that I hope someone can help me with. My application counts on Lucene maintaining the order of the documents exactly the same as how I insert them. Lucene is supposed to maintain document order, even across index merges, correct? My

lucene sorting

2006-04-05 Thread Gian Marco Tagliani
Hi, I need to change the lucene sorting to give just a bit more relevance to the recent documents (but i don't want to sort by date). I'd like to mix the lucene score with the date of the document. I'm following the example in Lucene in Action, chapter 6. I'm trying to extends the

Re: Lucene or Nutch ?

2006-04-05 Thread Bruno Grilheres
Thanks for your answer, I was not aware of the SOLR project, There was a big typo here, I meant less than 10 Go of PDF files per day during one month = i.e. less than 300 Go of PDF files. I made some tests with PDF files, 100Mo or Native PDF are converted to 3Mo of index in lucene [The text

Re: Lucene or Nutch ?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Bruno Grilheres [EMAIL PROTECTED] wrote: Thanks for your answer, I was not aware of the SOLR project, There was a big typo here, I meant less than 10 Go of PDF files per day during one month = i.e. less than 300 Go of PDF files. Sorry, I'm not sure what the Go abbreviation is... I

Re: Lucene Document order not being maintained?

2006-04-05 Thread Chris Hostetter
: exactly the same as how I insert them. Lucene is supposed to maintain : document order, even across index merges, correct? Lucene definitely maintains index order for document additions -- but i don't know if any similar claim has been made about merging whole indexes. : this until I'm done

Re: lucene sorting

2006-04-05 Thread Chris Hostetter
I don't know if there is anyway for a Custom Sort to access the lucene score -- but another approach that works very well is to use the FunctionQuery classes from Solr... http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/package-summary.html ...you can make a

Re: QueryParser error + solution

2006-04-05 Thread miki sun
Daniel you are very clever! Your solution remind me this: No temptation has overtaken you but such as is common to man; and God is faithful, who will not allow you to be tempted beyond what you are able, but with the temptation will provide the way of escape also, so that you will be able to

Re: Optimize completely in memory with a FSDirectory?

2006-04-05 Thread Daniel Naber
On Mittwoch 05 April 2006 13:02, Max Pfingsthorn wrote: The setMaxBufferedDocs and related parameters help a lot already to fully exploit my RAM when indexing, but since I'm running a fairly small index of around 4 docs and I'm optimizing it relatively often, I was wondering if there is

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Chris Hostetter wrote: : exactly the same as how I insert them. Lucene is supposed to maintain : document order, even across index merges, correct? Lucene definitely maintains index order for document additions -- but i don't know if any similar claim has been made about merging whole indexes.

Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote: I'll continue to try to generate a test case that gets the docs out of order... but if someone in the know could answer authoritatively whether I browsed the code for IndexWriter.addIndexes(Dir[]), and it looks like it should preserve order. The

Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote: I haven't been able to recreate the out-of-order problem. However, with my real process, with a ton more data, I can recreate it every single time I index (it even gets the same documents out of order, consistently). If you have enough file

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Yonik Seeley wrote: On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote: I'll continue to try to generate a test case that gets the docs out of order... but if someone in the know could answer authoritatively whether I browsed the code for IndexWriter.addIndexes(Dir[]), and it looks like it

Re: Lucene Document order not being maintained?

2006-04-05 Thread Chris Hostetter
: Well, I set out to write JUnit test case to quickly show this... but : I'm having a heck of a time doing it. With relatively small numbers of : documents containing very few fields... I haven't been able to recreate : the out-of-order problem. However, with my real process, with a ton : more

Re: Lucene Document order not being maintained?

2006-04-05 Thread Doug Cutting
Dan Armbrust wrote: My indexing process works as follows (and some of this is hold-over from the time before lucene had a compound file format - so bear with me) I open up a File based index - using a merge factor of 90, and in my current test, the compound index format. When I have added

Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Doug Cutting [EMAIL PROTECTED] wrote: As others have noted, this should work correctly. One slight oddity I noticed with addIndexes(Dir[]) is that merging starts at one past the first new segment added (not the first new segment). It doesn't seem like that should hurt much though.

Re: Throughput doesn't increase when using more concurrent threads

2006-04-05 Thread Peter Keegan
Out of interest, does indexing time speed up much on 64-bit hardware? I was able to speed up indexing on 64-bit platform by taking advantage of the larger address space to parallelize the indexing process. One thread creates index segments with a set of RAMDirectories and another thread merges

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Yonik Seeley wrote: For your test case, try lowering numbers, such as maxBufferedDocs=2, mergeFactor=2 or 3 to create more segments more quickly and cause more merges with fewer documents. Good suggestion. A merge factor of 2 made it happen much more quickly. Bug is filed:

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Doug Cutting wrote: I assume that your merge factor when calling addIndexes() is less than 90. If it's 90, then what you're doing is the same as Lucene would automatically do. I think you could save yourself a lot of trouble if you simply lowered your merge factor substantially and then

Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote: Yonik Seeley wrote: For your test case, try lowering numbers, such as maxBufferedDocs=2, mergeFactor=2 or 3 to create more segments more quickly and cause more merges with fewer documents. Good suggestion. A merge factor of 2 made it

Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
Ah Ha! I found the problem. SegmentInfos.read(Directory directory) reads the segment info in reverse order! I gotta go home now... I'll look into the right fix later (it depends on what else uses that method...) FYI, I managed to reproduce it with only 3 documents in each index. -Yonik

Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
Spoke too soon... the loop counter goes down to zero, but it looks like the segments are added in order. for (int i = input.readInt(); i 0; i--) { // read segmentInfos SegmentInfo si = new SegmentInfo(input.readString(), input.readInt(), directory);

Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
I realized what the real problem was during the drive home. merged segments are added after all other segments, instead of the spot the original segments resided. I'll propose a patch soon... -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
OK, the following patch seems to work for me! You might want to try it out on your larger test Dan. The first part probably isn't necessary (the base=start instead of start+1), but the second part is. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server Index:

Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
addIndexes(Dir[]) was the only user of mergeSegments() that passed an endpoint that wasn't the end of the segment list, and hence the only caller to mergeSegments() that will see a change of behavior. Given that, I feel comfortable enough to commit this. -Yonik http://incubator.apache.org/solr

Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Thanks guys as always... lucene (and especially the people behind it) are top notch. Less than 6 hours from the time I figured out that the bug was in Lucene (and not my code, which is usually the case) - and its already fixed (I'm going to assume - I'll test it tomorrow when I get to work)

Re: highlighting - fuzzy search

2006-04-05 Thread Daniel Noll
mark harwood wrote: Isn't that what Query.extractTerms is for? Isn't it implimented by all primitive Queries?.. As of last week, yes. I changed the SpanQueries to implement this method and then refactored the Highlighter package's QueryTermExtractor to make use of this (it radically