Lucene or Nutch ?

2006-04-05 Thread Bruno Grilheres

 Hi All,

I have to develop a protoype of a search/indexation system with the 
following characteristics,
1) High volume of data indexation but only with add and delete 
functionality (approximatively 10 PDF) = scalable architecture HDFS 
seems good.

2) Specific analysis chain and a given set of meta-data indexation.
3) Language Recognition
4) No graphical interface for searching is needed, no crawling is 
needed, Indexation and Search are performed with HTTP Request to a Servlet


What is the best starting choice for this : Lucene or Nutch ?

As far as I know Lucene is a good choice for 2 and 4, Nutch is a better 
choice for 1 and 3.


Is Nutch as configurable as Lucene regarding the indexation and search 
process and is it possible to write plug-in for specific analysis  ?


Bruno





___ 
Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international.

Téléchargez sur http://fr.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FS lock on NFS mounted filesystem for indexing

2006-04-05 Thread Supriya Kumar Shyamal

Hi All,

I got a strange problem during the indexer process running on Redhat ES4 
Linux machine ..
java.io.FileNotFoundException: /u01/export/index/books/_2s.fnm (No such 
file or directory)

   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.init(RandomAccessFile.java:212)
   at 
org.apache.lucene.store.FSIndexInput$Descriptor.init(FSDirectory.java:425)

   at org.apache.lucene.store.FSIndexInput.init(FSDirectory.java:434)
   at 
org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)

   at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:56)
   at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)

   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
   at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
   at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:674)
   at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
   at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:646)
   at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:453)
   at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:436)


After looking through the mailinglist probaly I am thinking that when 
the indexer runs on NFS mounted filesystem then its a problem of 
filesystem locking because I run the indexer and parallaly I also search 
on the index.


I am using Lucene 1.9.1 version with JDK1.5 VM on Redhat ES4 64 bit 
Linux Dual Core Opteron processor. Any information will be very greatful.


Thanks,
supriya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Optimize completely in memory with a FSDirectory?

2006-04-05 Thread Max Pfingsthorn
Hi all,

I have a question about memory/fileio settings and the FSDirectory.
The setMaxBufferedDocs and related parameters help a lot already to fully 
exploit my RAM when indexing, but since I'm running a fairly small index of 
around 4 docs and I'm optimizing it relatively often, I was wondering if 
there is any way to enforce complete in-memory optimization.
The stupid thing is that even with a maxBufferedDocs of 5, it still writes 
lots of tiny files to disk (together almost 2-3 times the size of the index), 
and the disk io skyrockets for a few seconds. I have enough memory to hold the 
index many times over, so it really shouldn't be a problem there, and it would 
be so much faster (I have to think).

Any hints?

Best regards,

Max Pfingsthorn

Hippo  

Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-
[EMAIL PROTECTED] / www.hippo.nl
-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Which Analyzer to use when searching on Keyword fields

2006-04-05 Thread Satuluri, Venu_Madhav
Hi,

I am using lucene 1.4.3. Some of my fields are indexed as Keywords. I
also have subclassed Analyzer inorder to put stemming etc. I am not sure
if the input is tokenized when I am searching on keyword fields; I don't
want it to be. Do I need to have a special case in the overridden method
(Analyzer.tokenStream() ) to handle keyword fields? 

I've noticed that there's a KeywordTokenizer now in the API, but its not
there for lucene 1.4.3. If I was using 1.9, I could probably determine
if the field was a keyword one and then return a
KeywordTokenizer(Reader), but I am using 1.4.3.

Any advice is appreciated.
-Venu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Which Analyzer to use when searching on Keyword fields

2006-04-05 Thread Erik Hatcher

Venu,

I presume you're asking about what Analyzer to use with QueryParser.   
QueryParser analyzes all term text, but you can fake it for Keyword  
(non-tokenized) fields by using PerFieldAnalyzerWrapper, specifying  
the KeywordAnalyzer for the fields you indexed as such.


The KeywordAnalyzer code will work with 1.4.3, so just grab that  
class and put it into your project.  A couple of variations of it are  
also included with the Lucene in Action code.


Erik


On Apr 5, 2006, at 7:52 AM, Satuluri, Venu_Madhav wrote:


Hi,

I am using lucene 1.4.3. Some of my fields are indexed as Keywords. I
also have subclassed Analyzer inorder to put stemming etc. I am not  
sure
if the input is tokenized when I am searching on keyword fields; I  
don't
want it to be. Do I need to have a special case in the overridden  
method

(Analyzer.tokenStream() ) to handle keyword fields?

I've noticed that there's a KeywordTokenizer now in the API, but  
its not

there for lucene 1.4.3. If I was using 1.9, I could probably determine
if the field was a keyword one and then return a
KeywordTokenizer(Reader), but I am using 1.4.3.

Any advice is appreciated.
-Venu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Which Analyzer to use when searching on Keyword fields

2006-04-05 Thread Satuluri, Venu_Madhav
You understood me right, Erik. Your solution is working well, thanks.

Venu

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 05, 2006 6:03 PM
To: java-user@lucene.apache.org
Subject: Re: Which Analyzer to use when searching on Keyword fields


Venu,

I presume you're asking about what Analyzer to use with QueryParser.   
QueryParser analyzes all term text, but you can fake it for Keyword  
(non-tokenized) fields by using PerFieldAnalyzerWrapper, specifying  
the KeywordAnalyzer for the fields you indexed as such.

The KeywordAnalyzer code will work with 1.4.3, so just grab that  
class and put it into your project.  A couple of variations of it are  
also included with the Lucene in Action code.

Erik


On Apr 5, 2006, at 7:52 AM, Satuluri, Venu_Madhav wrote:

 Hi,

 I am using lucene 1.4.3. Some of my fields are indexed as Keywords. I
 also have subclassed Analyzer inorder to put stemming etc. I am not  
 sure
 if the input is tokenized when I am searching on keyword fields; I  
 don't
 want it to be. Do I need to have a special case in the overridden  
 method
 (Analyzer.tokenStream() ) to handle keyword fields?

 I've noticed that there's a KeywordTokenizer now in the API, but  
 its not
 there for lucene 1.4.3. If I was using 1.9, I could probably determine
 if the field was a keyword one and then return a
 KeywordTokenizer(Reader), but I am using 1.4.3.

 Any advice is appreciated.
 -Venu

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



searching offline

2006-04-05 Thread Delip Rao
Hi,

I have a large collection of text documents that I want to search
using lucene. Is there any command line utility that will allow me to
search this static collection of documents?

Writing one is an option but I want to know if anyone has already done this.

Thanks in advance,
Delip

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: searching offline

2006-04-05 Thread Satuluri, Venu_Madhav
Red Piranha: http://red-piranha.sourceforge.net/

-Original Message-
From: Delip Rao [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 05, 2006 6:53 PM
To: java-user@lucene.apache.org
Subject: searching offline


Hi,

I have a large collection of text documents that I want to search
using lucene. Is there any command line utility that will allow me to
search this static collection of documents?

Writing one is an option but I want to know if anyone has already done
this.

Thanks in advance,
Delip

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching offline

2006-04-05 Thread gekkokid

http://regain.sourceforge.net/ ?

- Original Message - 
From: Delip Rao [EMAIL PROTECTED]

To: java-user@lucene.apache.org
Sent: Wednesday, April 05, 2006 2:23 PM
Subject: searching offline


Hi,

I have a large collection of text documents that I want to search
using lucene. Is there any command line utility that will allow me to
search this static collection of documents?

Writing one is an option but I want to know if anyone has already done this.

Thanks in advance,
Delip

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re[4]: OutOfMemory with search(Query, Sort)

2006-04-05 Thread Yonik Seeley
On 4/5/06, Artem Vasiliev [EMAIL PROTECTED] wrote:
 The int[] array here contains references to String[] and to populate
 it still all the field values need to be loaded and compared/sorted

Terms are stored and iterated in sorted order, so no sorting needs to be done.
It's still the case that all the terms for that field need to be
iterated over though.

Another approach might be to store term vectors and retrieve the term
only from documents matching a particular query.  It might be slower
per query, but wouldn't have the overhead of populating the int[]

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



WRITE_LOCK_TIMEOUT

2006-04-05 Thread Guido Neitzer

Hi.

Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded  
and there is no way to set it from outside?


I've seen a check-in in the CVS from a few days ago which added  
getters/setters for this, but ... there is no release containing  
this, right?


So, my question is: Is it save to use a nightly build for production  
use?


Thanks,
cug




smime.p7s
Description: S/MIME cryptographic signature


Re: Lucene or Nutch ?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Bruno Grilheres [EMAIL PROTECTED] wrote:
 1) High volume of data indexation but only with add and delete
 functionality (approximatively 10 PDF) = scalable architecture HDFS
 seems good.
 2) Specific analysis chain and a given set of meta-data indexation.
 3) Language Recognition
 4) No graphical interface for searching is needed, no crawling is
 needed, Indexation and Search are performed with HTTP Request to a Servlet

 What is the best starting choice for this : Lucene or Nutch ?

 As far as I know Lucene is a good choice for 2 and 4, Nutch is a better
 choice for 1 and 3.

Solr would also be good for 2 and 4
As far as 1, what type of scalability requirements are we talking? (#
documents, size of docs, etc)

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WRITE_LOCK_TIMEOUT

2006-04-05 Thread Bill Janssen
 Hi.
 
 Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded  
 and there is no way to set it from outside?
 
 I've seen a check-in in the CVS from a few days ago which added  
 getters/setters for this, but ... there is no release containing  
 this, right?
 
 So, my question is: Is it save to use a nightly build for production  
 use?
 
 Thanks,
 cug

Or, as I suggested a couple of days ago, a 1.9.2 release could be offered.

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WRITE_LOCK_TIMEOUT

2006-04-05 Thread Guido Neitzer

On 05.04.2006, at 17:15 Uhr, Bill Janssen wrote:

Or, as I suggested a couple of days ago, a 1.9.2 release could be  
offered.


Would be a good idea, because the current nightly builds have a lot  
of deprecated methods removed which where available in 1.9.1.


Lot of work just for this ... :-(

cug




smime.p7s
Description: S/MIME cryptographic signature


Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
I'm using Lucene 1.9.1, and I'm seeing some odd behavior that I hope 
someone can help me with.


My application counts on Lucene maintaining the order of the documents 
exactly the same as how I insert them.  Lucene is supposed to maintain 
document order, even across index merges, correct?


My indexing process works as follows (and some of this is hold-over from 
the time before lucene had a compound file format - so bear with me)


I open up a File based index - using a merge factor of 90, and in my 
current test, the compound index format.  When I have added 100,000 
documents, I close this index, and start on a new index.  I continue 
this until I'm done with all of the documents.  Then, as a last step, I 
open up a new empty index, and I call addIndexes(Directory[]) - and I 
pass in the directories in the same order that I created them.



This allows me to use higher merge factors without running into file 
handle issues, and without having to call optimize.


The problem that I am seeing right now, is that when I look into my 
large combined index with Luke, Document number 899 is the 899th 
document that I added.  However, Document 900 is the 49860th document 
that I added.  This continues until Document 910, where it suddenly 
jumps to the 99720th document.


Is this a bug, or am I misusing something in the API?

Thanks,

Dan


--

Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene sorting

2006-04-05 Thread Gian Marco Tagliani

Hi,
I need to change the lucene sorting to give just a bit more relevance to 
the recent documents (but i don't want to sort by date). I'd like to mix 
the lucene score with the date of the document.


I'm following the example in Lucene in Action, chapter 6. I'm trying 
to extends the SortComparatorSource but I don't understand how to get 
the lucene score of the document.


Do you have some idea about how to solve my problem?
Or do you know where get some more example on custom sorting?

Thanks,
Gian Marco



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene or Nutch ?

2006-04-05 Thread Bruno Grilheres

Thanks for your answer, I was not aware of the SOLR project,

There was a big typo here, I meant less than 10 Go of PDF files per day 
during one month = i.e. less than 300 Go of PDF files.
I made some tests with PDF files, 100Mo or Native PDF are converted to 
3Mo of index in lucene [The text was indexed but not stored].


Bruno

Yonik Seeley wrote:

On 4/5/06, Bruno Grilheres [EMAIL PROTECTED] wrote:
  

1) High volume of data indexation but only with add and delete
functionality (approximatively 10 PDF) = scalable architecture HDFS
seems good.
2) Specific analysis chain and a given set of meta-data indexation.
3) Language Recognition
4) No graphical interface for searching is needed, no crawling is
needed, Indexation and Search are performed with HTTP Request to a Servlet

What is the best starting choice for this : Lucene or Nutch ?

As far as I know Lucene is a good choice for 2 and 4, Nutch is a better
choice for 1 and 3.



Solr would also be good for 2 and 4
As far as 1, what type of scalability requirements are we talking? (#
documents, size of docs, etc)

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




  







___ 
Nouveau : téléphonez moins cher avec Yahoo! Messenger ! Découvez les tarifs exceptionnels pour appeler la France et l'international.

Téléchargez sur http://fr.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene or Nutch ?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Bruno Grilheres [EMAIL PROTECTED] wrote:
 Thanks for your answer, I was not aware of the SOLR project,

 There was a big typo here, I meant less than 10 Go of PDF files per day
 during one month = i.e. less than 300 Go of PDF files.

Sorry, I'm not sure what the Go abbreviation is... I assume it's
Gigabytes (GB or GiB)?
If so, that's a lot.  I'd probably go with Nutch.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Chris Hostetter

: exactly the same as how I insert them.  Lucene is supposed to maintain
: document order, even across index merges, correct?

Lucene definitely maintains index order for document additions -- but i
don't know if any similar claim has been made about merging whole indexes.

: this until I'm done with all of the documents.  Then, as a last step, I
: open up a new empty index, and I call addIndexes(Directory[]) - and I
: pass in the directories in the same order that I created them.
...
: The problem that I am seeing right now, is that when I look into my
: large combined index with Luke, Document number 899 is the 899th
: document that I added.  However, Document 900 is the 49860th document
: that I added.  This continues until Document 910, where it suddenly
: jumps to the 99720th document.

As I said, i'm not sure if it's a bug or undefined behavior, but
can you post a self contained JUnit test demonstrating this? -- that way
people can look at exactly what is going on (if it is a bug).




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene sorting

2006-04-05 Thread Chris Hostetter

I don't know if there is anyway for a Custom Sort to access the lucene
score -- but another approach that works very well is to use the
FunctionQuery classes from Solr...

http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/package-summary.html

...you can make a FunctionQuery object that scores things linerarly (or
reciprocally, or any other funciton you impliment in java) based on the
value of any field -- and then add that query to a BooleanQuery along with
your orriginal query and use the boost to determine how much of an
influence it has on your final score.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser error + solution

2006-04-05 Thread miki sun

Daniel you are very clever! Your solution remind me this:
No temptation has overtaken you but such as is common to man; and God is 
faithful, who will not allow you to be tempted beyond what you are able, but 
with the temptation will provide the way of escape also, so that you will be 
able to endure it.

1 Corinthians 10:13 (New American Standard Version)

Well done Erik!


Original Message Follows
From: Daniel Noll [EMAIL PROTECTED]
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: QueryParser error + solution
Date: Wed, 05 Apr 2006 14:26:20 +1000

miki sun wrote:

Thanks Erik and Michael!

I copied some code from demo.SearchFiles.java, I do not have a more clearer 
tracing message. Now it works.


But do you have a better way than this:


[snip]

Something like this?

  String str = Really bad query string: lots of evil stuff!;
  str = QueryParser.escape(str);

Daniel

--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://www.nuix.com.au/Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimize completely in memory with a FSDirectory?

2006-04-05 Thread Daniel Naber
On Mittwoch 05 April 2006 13:02, Max Pfingsthorn wrote:

 The setMaxBufferedDocs and related parameters help a lot already to
 fully exploit my RAM when indexing, but since I'm running a fairly small
 index of around 4 docs and I'm optimizing it relatively often, I was
 wondering if there is any way to enforce complete in-memory
 optimization.

Maybe you could use a RAMDirectory and write it to disk using 
IndexWriter.addIndexes() from time to time?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust

Chris Hostetter wrote:

: exactly the same as how I insert them.  Lucene is supposed to maintain
: document order, even across index merges, correct?

Lucene definitely maintains index order for document additions -- but i
don't know if any similar claim has been made about merging whole indexes.

: this until I'm done with all of the documents.  Then, as a last step, I
: open up a new empty index, and I call addIndexes(Directory[]) - and I
: pass in the directories in the same order that I created them.
...
: The problem that I am seeing right now, is that when I look into my
: large combined index with Luke, Document number 899 is the 899th
: document that I added.  However, Document 900 is the 49860th document
: that I added.  This continues until Document 910, where it suddenly
: jumps to the 99720th document.

As I said, i'm not sure if it's a bug or undefined behavior, but
can you post a self contained JUnit test demonstrating this? -- that way
people can look at exactly what is going on (if it is a bug).




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Well, I set out to write  JUnit test case to quickly show this... but 
I'm having a heck of a time doing it.  With relatively small numbers of 
documents containing very few fields... I haven't been able to recreate 
the out-of-order problem.  However, with my real process, with a ton 
more data, I can recreate it every single time I index (it even gets the 
same documents out of order, consistently).


I'll continue to try to generate a test case that gets the docs out of 
order... but if someone in the know could answer authoritatively whether 
or not lucene is supposed to maintain document order when you merge 
multiple indexes together, that would be great.


Thanks,

Dan

--

Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote:
 I'll continue to try to generate a test case that gets the docs out of
 order... but if someone in the know could answer authoritatively whether

I browsed the code for IndexWriter.addIndexes(Dir[]), and it looks
like it should preserve order.
The directories are added in order, and the segments for each
directory are added in order.  The merging code is shared, so that
shouldn't do anything different than normal segment merges.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote:
 I haven't been able to recreate
 the out-of-order problem.  However, with my real process, with a ton
 more data, I can recreate it every single time I index (it even gets the
 same documents out of order, consistently).

If you have enough file handles, you can test if it's a Lucene problem
or your app by opening a MultiReader over all the indexes and testing
if the documents are in the order you think they are *before* merging.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust

Yonik Seeley wrote:

On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote:

I'll continue to try to generate a test case that gets the docs out of
order... but if someone in the know could answer authoritatively whether


I browsed the code for IndexWriter.addIndexes(Dir[]), and it looks
like it should preserve order.
The directories are added in order, and the segments for each
directory are added in order.  The merging code is shared, so that
shouldn't do anything different than normal segment merges.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Thanks for checking Yonik.  I'm fairly certain that this is a lucene bug 
then - I will try to come up with a reproduceable test case.


My load code is pretty simple... whenever I create a new document, I put 
in a field that contains a counter of the load order.


When I look at the individual indexes, things are fine - but after it 
merges them, I get a significant percentage of documents which have been 
reordered.


One other thing I can look into - I've been building these indexes on a 
64 bit linux machine, using a 64 bit JVM.  I need to see if the same 
error happens on 32 bit windows


Dan

--

Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Chris Hostetter

: Well, I set out to write  JUnit test case to quickly show this... but
: I'm having a heck of a time doing it.  With relatively small numbers of
: documents containing very few fields... I haven't been able to recreate
: the out-of-order problem.  However, with my real process, with a ton
: more data, I can recreate it every single time I index (it even gets the
: same documents out of order, consistently).

it's very possible that the problem is specific to large numbers of
documents/indexes, or that it's specific to FSDirectory - so if you can't
reproduce with a handfull of docs on a RAMDirectory don't shy away from
making a test case that creates 10 1GB indexes in ./test-doc-order-on-merge
or something like that if it's the only way to reproduce the problem.

just warn us if it it's not obvious from the code that it does that :)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Doug Cutting

Dan Armbrust wrote:
My indexing process works as follows (and some of this is hold-over from 
the time before lucene had a compound file format - so bear with me)


I open up a File based index - using a merge factor of 90, and in my 
current test, the compound index format.  When I have added 100,000 
documents, I close this index, and start on a new index.  I continue 
this until I'm done with all of the documents.  Then, as a last step, I 
open up a new empty index, and I call addIndexes(Directory[]) - and I 
pass in the directories in the same order that I created them.


This allows me to use higher merge factors without running into file 
handle issues, and without having to call optimize.


As others have noted, this should work correctly.

I assume that your merge factor when calling addIndexes() is less than 
90.  If it's 90, then what you're doing is the same as Lucene would 
automatically do.  I think you could save yourself a lot of trouble if 
you simply lowered your merge factor substantially and then indexed 
everything in one pass.  To make things go faster, set 
maxBufferedDocs=100 or larger.  This should be as fast as what you're 
doing now and a lot simpler.


Or is that the part where I was supposed to bear with you?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Doug Cutting [EMAIL PROTECTED] wrote:

 As others have noted, this should work correctly.

One slight oddity I noticed with addIndexes(Dir[]) is that merging
starts at one past the first new segment added (not the first new
segment).  It doesn't seem like that should hurt much though.


-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Throughput doesn't increase when using more concurrent threads

2006-04-05 Thread Peter Keegan
 Out of interest, does indexing time speed up much on 64-bit hardware?

I was able to speed up indexing on 64-bit platform by taking advantage of
the larger address space to parallelize the indexing process. One thread
creates index segments with a set of RAMDirectories and another thread
merges the segments to disk with 'addIndexes'. This resulted in a speed
improvement of 27%.

Peter


On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:

 Peter Keegan wrote:
  I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
  getting 250 queries/sec and excellent cpu utilization (equal concurrency
 on
  all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't
 aware
  of it.
 
 Wow.  That's fast.

 Out of interest, does indexing time speed up much on 64-bit hardware?
 I'm particularly interested in this side of things because for our own
 application, any query response under half a second is good enough, but
 the indexing side could always be faster. :-)

 Daniel

 --
 Daniel Noll

 Nuix Australia Pty Ltd
 Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
 Phone: (02) 9280 0699
 Fax:   (02) 9212 6902

 This message is intended only for the named recipient. If you are not
 the intended recipient you are notified that disclosing, copying,
 distributing or taking any action in reliance on the contents of this
 message or attachment is strictly prohibited.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust

Yonik Seeley wrote:

For your test case, try lowering numbers, such as maxBufferedDocs=2,
mergeFactor=2 or 3
to create more segments more quickly and cause more merges with fewer documents.


Good suggestion.  A merge factor of 2 made it happen much more quickly. 
 Bug is filed:


http://issues.apache.org/jira/browse/LUCENE-540

JUnit test case is attached (although it may not be in the proper format 
for lucene - but I think its pretty straight forward)


Dan

--

Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust

Doug Cutting wrote:


I assume that your merge factor when calling addIndexes() is less than 
90.  If it's 90, then what you're doing is the same as Lucene would 
automatically do.  I think you could save yourself a lot of trouble if 
you simply lowered your merge factor substantially and then indexed 
everything in one pass.  To make things go faster, set 
maxBufferedDocs=100 or larger.  This should be as fast as what you're 
doing now and a lot simpler.


Or is that the part where I was supposed to bear with you?

Doug



Yep.  This code was written when I had to index tons of stuff on linux, 
and was constantly running into file handle issues (even with low merge 
factors).  I ended up writing a wrapper for lucene that handled it all 
for me, and I've just been reusing it.  Then today, I ran into this 
issue.  It may be time to rework some of the wrapper to take advantage 
of the lucene updates :)


Dan


--

Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
On 4/5/06, Dan Armbrust [EMAIL PROTECTED] wrote:
 Yonik Seeley wrote:
  For your test case, try lowering numbers, such as maxBufferedDocs=2,
  mergeFactor=2 or 3
  to create more segments more quickly and cause more merges with fewer 
  documents.

 Good suggestion.  A merge factor of 2 made it happen much more quickly.
   Bug is filed:

 http://issues.apache.org/jira/browse/LUCENE-540

Thanks Dan, I'll look into it tonight, as promised.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
Ah Ha! I found the problem.

SegmentInfos.read(Directory directory) reads the segment info in reverse order!
I gotta go home now... I'll look into the right fix later (it depends
on what else uses that method...)

FYI, I managed to reproduce it with only 3 documents in each index.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
Spoke too soon... the loop counter goes down to zero, but it looks
like the segments are added in order.

  for (int i = input.readInt(); i  0; i--) { // read segmentInfos
SegmentInfo si =
  new SegmentInfo(input.readString(), input.readInt(), directory);
addElement(si);
  }

On 4/5/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 Ah Ha! I found the problem.

 SegmentInfos.read(Directory directory) reads the segment info in reverse 
 order!
 I gotta go home now... I'll look into the right fix later (it depends
 on what else uses that method...)

 FYI, I managed to reproduce it with only 3 documents in each index.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
I realized what the real problem was during the drive home.

merged segments are added after all other segments, instead of the
spot the original segments resided.

I'll propose a patch soon...

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
OK, the following patch seems to work for me!
You might want to try it out on your larger test Dan.

The first part probably isn't necessary (the base=start instead of
start+1), but the second part is.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server



Index: org/apache/lucene/index/IndexWriter.java
===
--- org/apache/lucene/index/IndexWriter.java(revision 391084)
+++ org/apache/lucene/index/IndexWriter.java(working copy)
@@ -569,7 +569,7 @@

 // merge newly added segments in log(n) passes
 while (segmentInfos.size()  start+mergeFactor) {
-  for (int base = start+1; base  segmentInfos.size(); base++) {
+  for (int base = start; base  segmentInfos.size(); base++) {
 int end = Math.min(segmentInfos.size(), base+mergeFactor);
 if (end-base  1)
   mergeSegments(base, end);
@@ -710,9 +710,9 @@
   infoStream.println( into +mergedName+ (+mergedDocCount+ docs));
 }

-for (int i = end-1; i = minSegment; i--) // remove old infos  add new

+for (int i = end-1; i  minSegment; i--) // remove old infos  add new
   segmentInfos.remove(i);
-segmentInfos.addElement(new SegmentInfo(mergedName, mergedDocCount,
+segmentInfos.set(minSegment, new SegmentInfo(mergedName, mergedDocCount,
 directory));

 // close readers before we attempt to delete now-obsolete segments

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Yonik Seeley
addIndexes(Dir[]) was the only user of mergeSegments() that passed an
endpoint that wasn't the end of the segment list, and hence the only
caller to mergeSegments() that will see a change of behavior.

Given that, I feel comfortable enough to commit this.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

On 4/5/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 OK, the following patch seems to work for me!
 You might want to try it out on your larger test Dan.

 The first part probably isn't necessary (the base=start instead of
 start+1), but the second part is.

 -Yonik
 http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server



 Index: org/apache/lucene/index/IndexWriter.java
 ===
 --- org/apache/lucene/index/IndexWriter.java(revision 391084)
 +++ org/apache/lucene/index/IndexWriter.java(working copy)
 @@ -569,7 +569,7 @@

  // merge newly added segments in log(n) passes
  while (segmentInfos.size()  start+mergeFactor) {
 -  for (int base = start+1; base  segmentInfos.size(); base++) {
 +  for (int base = start; base  segmentInfos.size(); base++) {
  int end = Math.min(segmentInfos.size(), base+mergeFactor);
  if (end-base  1)
mergeSegments(base, end);
 @@ -710,9 +710,9 @@
infoStream.println( into +mergedName+ (+mergedDocCount+ docs));
  }

 -for (int i = end-1; i = minSegment; i--) // remove old infos  add 
 new

 +for (int i = end-1; i  minSegment; i--) // remove old infos  add 
 new
segmentInfos.remove(i);
 -segmentInfos.addElement(new SegmentInfo(mergedName, mergedDocCount,
 +segmentInfos.set(minSegment, new SegmentInfo(mergedName, mergedDocCount,
  directory));

  // close readers before we attempt to delete now-obsolete segments


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Document order not being maintained?

2006-04-05 Thread Dan Armbrust
Thanks guys as always... lucene (and especially the people behind
it) are top notch.

Less than 6 hours from the time I figured out that the bug was in
Lucene (and not my code, which is usually the case) - and its already
fixed (I'm going to assume - I'll test it tomorrow when I get to work)

Amazing.

Thanks again,

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: highlighting - fuzzy search

2006-04-05 Thread Daniel Noll

mark harwood wrote:
Isn't that what Query.extractTerms is for?  Isn't it 
implimented by all primitive Queries?..


As of last week, yes. I changed the SpanQueries to
implement this method and then refactored the
Highlighter package's QueryTermExtractor to make use
of this (it radically simplified the code in there).
This change to rely on extractTerms also means that
the highlighter now works properly with classes like
FilteredQuery.


Very nice.  Yet another point I can add onto the huge list of reasons 
our app should update Lucene. :-)


Although I'd rather not rewrite the query first, it feels like it would 
use more memory than an extractTerms(IndexReader) method would.  Maybe 
I'm wrong on this, though.


Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://www.nuix.com.au/Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]