Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik
Mark Miller wrote: The contrib Highlighter doesn't know and highlights them all. Check out my patch here for position sensitive highlighting: https://issues.apache.org/jira/browse/LUCENE-794 It seems that the patch does not work with Lucene 2.2 as I get some compile errors. Is this really

Re: Self Join Query

2008-01-10 Thread sachin
Here are more details about my issue. I have two tables in database. A row in table 1 can have multiple rows associated with it in table 2. It is a one to many mapping. Let's say a row in table 1 is A and it has multiple rows B1, B2 and B3 associated with it in table 2. I need to search on both

Re: Highlighting + phrase queries

2008-01-10 Thread Mark Miller
Oh yeah...something that you may not have seen is that this has a dependency on MemoryIndex from contrib. You need that jar as well. - Mark Marjan Celikik wrote: Mark Miller wrote: The contrib Highlighter doesn't know and highlights them all. Check out my patch here for position sensitive

Re: Highlighting + phrase queries

2008-01-10 Thread Mark Miller
It should work no problem with 2.2. What are the compile errors you are getting? If you send me a note directly I will send you a jar. - Mark Marjan Celikik wrote: Mark Miller wrote: The contrib Highlighter doesn't know and highlights them all. Check out my patch here for position

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL

Re: Highlighting + phrase queries

2008-01-10 Thread Mark Miller
The Highlighter works by comparing the TokenStream of the document with the Tokens in the query. The TokenStream can be rebuilt from the index if you use TermVectors with TokenSources or you can get it by reanalyzing the document. Each Token from the TokenStream is checked against Tokens in

Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik
Mark Miller wrote: Oh yeah...something that you may not have seen is that this has a dependency on MemoryIndex from contrib. You need that jar as well. - Mark Hm, I need the source code. How do I download the files from https://issues.apache.org/jira/browse/LUCENE-794 (all I see are some

Re: Self Join Query

2008-01-10 Thread Paul Elschot
Sachin, As the merging of the results is the issue, I'll assume that you don't have clear user requirements for that. The simplest way out of that is to allow the users to search the B's first, and once they have determined which B's they'd like to use, use those B's to limit the results in of

Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik
Mark Miller wrote: The Highlighter works by comparing the TokenStream of the document with the Tokens in the query. The TokenStream can be rebuilt from the index if you use TermVectors with TokenSources or you can get it by reanalyzing the document. Each Token from the TokenStream is checked

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then

RE: how do I get my own TopDocHitCollector?

2008-01-10 Thread Beard, Brian
Thanks for the post. So you're using the doc id as the key into the cache to retrieve the external id. Then what mechanism fetches the external id's from the searcher and places them in the cache? -Original Message- From: Antony Bowesman [mailto:[EMAIL PROTECTED] Sent: Wednesday,

Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik
Marjan Celikik wrote: Mark Miller wrote: The Highlighter works by comparing the TokenStream of the document with the Tokens in the query. The TokenStream can be rebuilt from the index if you use TermVectors with TokenSources or you can get it by reanalyzing the document. Each Token from the

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Erick Erickson
This seems really clunky. Especially if your merge step also optimizes. There's not much point in indexing into RAM then merging explicitly. Just use an FSDirectory rather than a RAMDirectory. There is *already* buffering built in to FSDirectory, and your merge factor etc. control how much RAM is

Re: Highlighting + phrase queries

2008-01-10 Thread Marjan Celikik
Mark Miller wrote: That is why the original contrib does not work with PhraseQuery's. It simply matches Tokens from the query with those in the TokenStream. LUCENE-794 takes the TokenStream and shoves it into a MemoryIndex. Then, after converting the query to a SpanQuery approximation,

Re: Highlighting + phrase queries

2008-01-10 Thread Mark Miller
I don't think you would see much of gain. Shoving the TokenStream into the MemoryIndex is actually pretty fast and I wouldn't be surprised if it was much faster than reading from disk. Most of the computational time is spent in reconstructing the TokenStream, whether you use term-vectors or

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Michael McCandless
If possible you should also test the soon-to-be-released version 2.3, which has a number of speedups to indexing. Also try the steps here: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed You should also try an A/B test: A) writing your index to the NFS directory and then B) to

RE: how do I get my own TopDocHitCollector?

2008-01-10 Thread Beard, Brian
Ok, I've been thinking about this some more. Is the cache mechanism pulling from the cache if the external id already exists there and then hitting the searcher if it's not already in the cache (maybe using a FieldSelector for just retrieving the external id)? -Original Message- From:

Retrieve number of terms

2008-01-10 Thread chris.b
I'm sure this has been asked a few times before, but i searched and searched and found no answer (apart from using luke), but I would like to know if there's a way of retrieving the number of terms in an index. I tried cycling through a TermEnum, but i doesn't do anything :| -- View this message

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
I am indexing into RAM then merging explicitly because my application demand it due to I have design it as a distributed enviroment so many threads or workers are in different machines indexing into RAM serialize to disk an another thread in another machine access the segment index to merge it

Re: Retrieve number of terms

2008-01-10 Thread Luis Rodrigo
Hi Chris, by number of terms, do you mean the number of different terms that compose the index, or the numers of total terms, including repetitions? chris.b escribió: I'm sure this has been asked a few times before, but i searched and searched and found no answer (apart from using luke),

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Otis Gospodnetic
Ariel, Comments inline. - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 10:05:28 AM Subject: Re: Why is lucene so slow indexing in nfs file system ? In a distributed enviroment the application should make an exhaustive

Re: how do I get my own TopDocHitCollector?

2008-01-10 Thread Antony Bowesman
Beard, Brian wrote: Ok, I've been thinking about this some more. Is the cache mechanism pulling from the cache if the external id already exists there and then hitting the searcher if it's not already in the cache (maybe using a FieldSelector for just retrieving the external id)? I am warming

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
Thanks for yours suggestions. I'm sorry I didn't know but I would want to know what Do you mean with SAN and FC? Another thing, I have visited the lucene home page and there is not released the 2.3 version, could you tell me where is the download link ? Thanks in advance. Ariel On Jan 10, 2008

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Chris Lu
SAN is Storage Area Network. FC is fiber channel. I can confirm by one customer experience that using SAN does scale pretty well, and pretty simple. Well, it costs some money. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site:

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Otis Gospodnetic
2.3 is in the process of being released. Give it another week to 10 days and it will be out. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 6:26:44 PM