Realtime search best practices

2009-10-12 Thread melix
Hi, I'm going to replace an old reader/writer synchronization mechanism we had implemented with the new near realtime search facilities in Lucene 2.9. However, it's still a bit unclear on how to efficiently do it. Is the following implementation the good way to do achieve it ? The context is

Lucene

2009-10-12 Thread nja
Hi , I am using StandardAnalyzer for indexing as well as searching the indexes.But my search doesn't work correctly with special characters.I am storing some special characters in a field called TransType.ie document.add(new Field(TransType, db92fb60-b716-11de-8718-001a4bc7d46e,

Re: How do you properly use NumericField

2009-10-12 Thread Paul Taylor
Uwe Schindler wrote: I forgot: The format of numeric fields is also not plain text, because of this a simple TermQuery as generated by your query parser will not work, too. If you want to hit numeric values without a NumericRangeQuery with lower and upper bound equal, you have to use

faceted search performance

2009-10-12 Thread Christoph Boosz
Hi, I have a question related to faceted search. My index contains more than 1 million documents, and nearly 1 million terms. My aim is to get a DocIdSet for each term occurring in the result of a query. I use the approach described on

Re: faceted search performance

2009-10-12 Thread Paul Elschot
On Monday 12 October 2009 14:53:45 Christoph Boosz wrote: Hi, I have a question related to faceted search. My index contains more than 1 million documents, and nearly 1 million terms. My aim is to get a DocIdSet for each term occurring in the result of a query. I use the approach described

RE: How do you properly use NumericField

2009-10-12 Thread Uwe Schindler
Can you print the upper and lower term or the term you received in newRangeQuery and newTermQuery also to System.out? Maybe it is converted somehow by your Analyzer, that is used for parsing the query. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail:

Re: How do you properly use NumericField

2009-10-12 Thread Paul Taylor
Uwe Schindler wrote: Can you print the upper and lower term or the term you received in newRangeQuery and newTermQuery also to System.out? Maybe it is converted somehow by your Analyzer, that is used for parsing the query. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen

Re: faceted search performance

2009-10-12 Thread John Wang
Given you have 1M docs and about 1M terms, do you see very few docs per term? If your DocSet per term is very sparse, BitSet is probably not a good representation. Simple int array maybe better for memory, and faster for iterating. -John On Mon, Oct 12, 2009 at 8:45 AM, Paul Elschot

Re: Getting left and right offsets of term search results

2009-10-12 Thread Till Kolter
Thanks a lot. I think TermPositionsVector will solve my problem. Although it seems to be a little inperformant Concerning the term representation: our data is way more complex then just phrasal annotation, it was just an example, because I am not allowed to talk about our internal organisation. I

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
Hi Cedric, I don't know of anyone with a substantial throughput production system who is doing realtime search with the 2.9 improvements yet (and in fact, no serious performance analysis has been done on these even in the lab so to speak: follow https://issues.apache.org/jira/browse/LUCENE-1577

Re: Realtime search best practices

2009-10-12 Thread Michael McCandless
On Mon, Oct 12, 2009 at 3:17 PM, Jake Mannix jake.man...@gmail.com wrote: Wait, so according to the javadocs, the IndexReader which you got from the IndexWriter forwards calls to reopen() back to IndexWriter.getReader(), which means that if the user has a NRT reader, and the user keeps calling

Re: faceted search performance

2009-10-12 Thread Christoph Boosz
Hi Jake, Thanks for your helpful explanation. In fact, my initial solution was to traverse each document in the result once and count the contained terms. As you mentioned, this process took a lot of memory. Trying to confine the memory usage with the facet approach, I was surprised by the

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
On Mon, Oct 12, 2009 at 12:26 PM, Michael McCandless luc...@mikemccandless.com wrote: On Mon, Oct 12, 2009 at 3:17 PM, Jake Mannix jake.man...@gmail.com wrote: Wait, so according to the javadocs, the IndexReader which you got from the IndexWriter forwards calls to reopen() back to

Re: new sorting api and some perf numbers

2009-10-12 Thread Bradford Stephens
Wow! This is awesome. Can't wait to see how it plays with Bobo :) On Sun, Oct 11, 2009 at 10:19 PM, John Wang john.w...@gmail.com wrote: Hi guys:    The new FieldComparator api looks really scary :)    But after some perf testing with numbers I'd like to share, I guess it is worth it: HW:

querying multi-value fields

2009-10-12 Thread Angel, Eric
I have documents that store multiple values in some fields (using the document.add(new Field()) with the same field name). Here's what a typical document looks like: doc.option=value1 aaa doc.option=value2 bbb doc.option=value3 ccc I want my queries to only match individual values, for

Re: Realtime search best practices

2009-10-12 Thread John Wang
Oh, that is really good to know! Is this deterministic? e.g. as long as writer.addDocument() is called, next getReader reflects the change? Does it work with deletes? e.g. writer.deleteDocuments()? Thanks Mike for clarifying! -John On Mon, Oct 12, 2009 at 12:11 PM, Michael McCandless

Re: querying multi-value fields

2009-10-12 Thread Adriano Crestani
Hi Eric, To achieve what you want, do not tokenize the values you query/add to this field. On Mon, Oct 12, 2009 at 4:05 PM, Angel, Eric ean...@business.com wrote: I have documents that store multiple values in some fields (using the document.add(new Field()) with the same field name). Here's

Re: querying multi-value fields

2009-10-12 Thread Jake Mannix
Or else just make sure that you use PhraseQuery to hit this field when you want value1 aaa. If you don't tokenize these pairs, then you will have to do prefix/wildcard matching to hit just value1 by itself (if this is allowed by your business logic). -jake On Mon, Oct 12, 2009 at 1:21 PM,

Re: Realtime search best practices

2009-10-12 Thread Yonik Seeley
Guys, please - you're not new at this... this is what JavaDoc is for: /** * Returns a readonly reader containing all * current updates. Flush is called automatically. This * provides near real-time searching, in that changes * made during an IndexWriter session can be made *

RE: querying multi-value fields

2009-10-12 Thread Angel, Eric
I need to analyze these values since I also want the benefits porterStemmer. The problem with using PhraseQuery is that I don't always know the slop. I may have values like value4 ddd aaa. It's a tricky problem because I think Lucene sees all these values as one long value for the field option.

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
Thanks Yonik, It may be surprising, but in fact I have read that javadoc. It talks about not needing to close the writer, but doesn't specifically talk about the what the relationship between commit() calls and getReader() calls is. I suppose I should have interpreted: @returns a new reader

Re: querying multi-value fields

2009-10-12 Thread Erick Erickson
I think Lucene sees all these values as one long value for the field option Not quite. Starting with the second add, a call will be made to getPositionIncrementGap in your analyzer. If you return a number larger than one, then the offsets between the last term of the preceeding add and the first

Re: Realtime search best practices

2009-10-12 Thread Jason Rutherglen
Hi Cedric, There is a wiki page on NRT at: http://wiki.apache.org/lucene-java/NearRealtimeSearch Feel free tp ask questions if there's not enough information. -J On Mon, Oct 12, 2009 at 2:24 AM, melix cedric.champ...@lingway.com wrote: Hi, I'm going to replace an old reader/writer

Re: faceted search performance

2009-10-12 Thread Paul Elschot
Chris, You could also store term vectors for all docs at indexing time, and add the termvectors for the matching docs into a (large) map of terms in RAM. Regards, Paul Elschot On Monday 12 October 2009 21:30:48 Christoph Boosz wrote: Hi Jake, Thanks for your helpful explanation. In fact,

Re: Realtime search best practices

2009-10-12 Thread Michael McCandless
I agree, the javadocs could be improved. How about something like this for the first 2 paragraphs: * Returns a readonly reader, covering all committed as * well as un-committed changes to the index. This * provides near real-time searching, in that changes * made during an

Re: Realtime search best practices

2009-10-12 Thread Yonik Seeley
On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix jake.man...@gmail.com wrote:  It may be surprising, but in fact I have read that javadoc. It was not your email I responded to.  It talks about not needing to close the writer, but doesn't specifically talk about the what the relationship between

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
That seems a lot more straightforward Mike, thanks. -jake On Mon, Oct 12, 2009 at 1:56 PM, Michael McCandless luc...@mikemccandless.com wrote: I agree, the javadocs could be improved. How about something like this for the first 2 paragraphs: * Returns a readonly reader, covering all

Re: Realtime search best practices

2009-10-12 Thread Michael McCandless
OK I just committed it -- thanks! Mike On Mon, Oct 12, 2009 at 5:01 PM, Jake Mannix jake.man...@gmail.com wrote: That seems a lot more straightforward Mike, thanks.  -jake On Mon, Oct 12, 2009 at 1:56 PM, Michael McCandless luc...@mikemccandless.com wrote: I agree, the javadocs could be

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
On Mon, Oct 12, 2009 at 1:57 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix jake.man...@gmail.com wrote: It may be surprising, but in fact I have read that javadoc. It was not your email I responded to. Sorry, my bad then - you said guys

Re: Realtime search best practices

2009-10-12 Thread John Wang
I think it was my email Yonik responded to and he is right, I was being lazy and didn't read the javadoc very carefully.My bad. Thanks for the javadoc change. -John On Mon, Oct 12, 2009 at 1:57 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix

Re: faceted search performance

2009-10-12 Thread Christoph Boosz
Hi Paul, Thanks for your suggestion. I will test it within the next few days. However, due to memory limitations, it will only work if the number of hits is small enough, am I right? Chris 2009/10/12 Paul Elschot paul.elsc...@xs4all.nl Chris, You could also store term vectors for all docs

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
I still see some things we might want to document or explain: We still need to be careful what the call to isCurrent() will mean in the future for IndexReaders - as now there is another kind of current - current even up to uncommitted changes. Imagine the following set of IndexReaders floating

Re: Realtime search best practices

2009-10-12 Thread melix
Ok, thanks for the details. I see I'm not the only one finding the javadoc hard to understand. While this is well documented, it's still not clear enough about the exact semantics of changes : at first I thought it returned an IndexReader on the *uncommited changes only*, which meant it did not

Re: Realtime search best practices

2009-10-12 Thread Yonik Seeley
Good point on isCurrent - I think it should only be with respect to the latest index commit point? and we should clarify that in the javadoc. [...] // but what does the nrtReader say? // it does not have access to the most recent commit // state, as there's been a commit (with documents) //

RE: querying multi-value fields

2009-10-12 Thread Angel, Eric
Erick, Thank you. This is awesome. I got it to work by just setting slop to 1 and returning 10 in my analyzer.getPositionIncrementGap. Here are my tests in case anyone else is interested: public class TestPositionIncrementGap extends TestCase { Analyzer analyzer = new

Using TermVectorMapper to compute term frequency across documents

2009-10-12 Thread Thomas D'Silva
Hi, I am trying to compute the counts of terms of the documents returned by running a query using a TermVectorMapper. I was wondering if anyone knew if there was a faster way to do this rather than using a HashMap with a TermVectorMapper to store the counts of the terms and calling