Re: Detection of index dublicates in Lucene

2007-07-30 Thread Michael Stoppelman
A couple of thoughts here... You could hash (e.g.md5) all the documents in your index and eliminate duplicates that way. Just pick one of the docs in the hash bucket as the non-dup document and the delete the other dups. This could be run as a batch job to eliminate the duplicates in an off-line

Re: Strange Error while deleting Documents from index while indexing.

2007-07-30 Thread Chris Hostetter
: Where shall i post this issue. you are currently posting to a list named java-user this is for user related questions about the java lucene project. if you have questions about Lucene.Net you should be asking them on the Lucene.Net user list... http://incubator.apache.org/lucene.net/

Re: Bug in Lucene 2.2.0 code? Simple code included (StringIndexOutOfBoundsException).

2007-07-30 Thread Mark Miller
Hey Lukas, I was being simplistic when I said that the text and TokenSteam must be exactly the same. It's difficult to think of a reason why you would not want them to be the same though. Each Token records the offsets where it can be found in the original text -- that is how the Highlighter

Re: Detection of index dublicates in Lucene

2007-07-30 Thread Grant Ingersoll
I believe Nutch has a duplicate detection algorithm. I don't know how easy it would be to run independently on a Lucene index. -Grant On Jul 29, 2007, at 2:18 AM, Dmitry wrote: We trying to find are any implementation for Lucene - detection index duclicates. Assuming we have a set of

Re: Detection of index dublicates in Lucene

2007-07-30 Thread karl wettin
30 jul 2007 kl. 14.43 skrev Grant Ingersoll: I believe Nutch has a duplicate detection algorithm. I don't know how easy it would be to run independently on a Lucene index. There have also been a bunch of near-duplicate ideas that have been presented on the forums before. This is one of

LUCENE-843 Release

2007-07-30 Thread testn
Hi guys, Do you think LUCENE-843 is stable enough? If so, do you think it's worth to release it with probably LUCENE 2.2.1? It would be nice so that people can take the advantage of it right away without risking other breaking changes in the HEAD branch or waiting until 2.3 release. Thanks, --

Indexing/Analyzer question - case-insensitive contains search

2007-07-30 Thread Joe Attardi
Hi everyone, I told you I'd be back with more questions! :-) Here is my situation. In my application, the field to be searched is selected via a drop-down box. I want my searches to basically be contains searches - I take what the user typed in, put a wildcard character at the beginning and end,

RE: Indexing/Analyzer question - case-insensitive contains search

2007-07-30 Thread Ard Schrijvers
Hello, Hi everyone, I told you I'd be back with more questions! :-) Here is my situation. In my application, the field to be searched is selected via a drop-down box. I want my searches to basically be contains searches - I take what the user typed in, put a wildcard character at the

Re: How to show category count with results?

2007-07-30 Thread Erick Erickson
You might want to search the mail archive for facets or faceted search (no quotes), as I *think* this might be relevant. Best Erick On 7/26/07, Ramana Jelda [EMAIL PROTECTED] wrote: Hi , Of course this statement is very expensive. --document.get(CAMPCATID)==null?:document.get(CAMPCATID);

Re: LUCENE-843 Release

2007-07-30 Thread Peter Keegan
I've built a production index with this patch and done some query stress testing with no problems. I'd give it a thumbs up. Peter On 7/30/07, testn [EMAIL PROTECTED] wrote: Hi guys, Do you think LUCENE-843 is stable enough? If so, do you think it's worth to release it with probably LUCENE

Re: Size of field?

2007-07-30 Thread Erick Erickson
See IndexWriter.setMaxFieldLength(). 87,300 is odd, since the default max field length, last I knew, was 10,000. But this sounds like it might relate to your issue. Best Erick On 7/27/07, Eduardo Botelho [EMAIL PROTECTED] wrote: Hi guys, I would like to know if exist some limit of size for

Re: Indexing/Analyzer question - case-insensitive contains search

2007-07-30 Thread Joe Attardi
It does sound very strange to me, to default to a WildCardQuery! Suppose I am looking for bold, I am getting hits for old. I know - but that's what the requirements dictate. A better example might be a MAC or IP address, where someone might be searching for a string in the middle - like, I

Re: Search terms on a single instance of field

2007-07-30 Thread Rafael Rossini
Hey Jeff, I didn´t had any luck, I don´t think you´re approach is going to help me, thanks for the reply. I´ll try a solution that does not require this kind of problem. []s Rossini On 7/29/07, Jeff French [EMAIL PROTECTED] wrote: Rossini, have you had any luck with this? I don't know if

Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Joe Attardi
Following up on my recent question. It has been suggested to me that I can run the query text through an Analyzer without using the QueryParser. For example, if I know what field to be searched I can create a PrefixQuery or WildcardQuery, but still want to process the search text with the same

RE: How to show category count with results?

2007-07-30 Thread Ard Schrijvers
Or check out Solr and see if you can use that, or see how they do it, Regards Ard You might want to search the mail archive for facets or faceted search (no quotes), as I *think* this might be relevant. Best Erick On 7/26/07, Ramana Jelda [EMAIL PROTECTED] wrote: Hi , Of

RE: Indexing/Analyzer question - case-insensitive contains search

2007-07-30 Thread Ard Schrijvers
It does sound very strange to me, to default to a WildCardQuery! Suppose I am looking for bold, I am getting hits for old. I know - but that's what the requirements dictate. A better example might be a MAC or IP address, where someone might be searching for a string in the middle -

Re: How to show category count with results?

2007-07-30 Thread Dennis Kubes
We found that a fast way to do this simply by running a query for each category and getting the maxDocs. There would be one query for category getting a single hit. Dennis Kubes Erick Erickson wrote: You might want to search the mail archive for facets or faceted search (no quotes), as I

Question regarding boolean query

2007-07-30 Thread Sonu SR
Hi, I am getting different results for the following queries. 1. ABST:spring-elastic^3 AND SPEC:internal combustion^2 OR ABST:cylinder^3 2. SPEC:internal combustion^2 AND ABST:spring-elastic^3 OR ABST:cylinder^3 I think the above two queries are similar and will give the same results.

Tokenizer

2007-07-30 Thread John Paul Sondag
I have two questions. First, Is there a tokenizer that takes every word and simply makes a token out of it? So it looks for two white spaces and takes the characters between them and makes a token out of them? If this tokenizer exists, is there a difference between doing that and simply storing

RE: Tokenizer

2007-07-30 Thread Ard Schrijvers
Hello, I have two questions. First, Is there a tokenizer that takes every word and simply makes a token out of it? org.apache.lucene.analysis.WhitespaceTokenizer So it looks for two white spaces and takes the characters between them and makes a token out of them? If this tokenizer

Re: How to show category count with results?

2007-07-30 Thread Dima May
Check this out: http://www.gossamer-threads.com/lists/lucene/java-user/35433?search_string=category;#35433 On 7/30/07, Dennis Kubes [EMAIL PROTECTED] wrote: We found that a fast way to do this simply by running a query for each category and getting the maxDocs. There would be one query for

RE: Question regarding boolean query

2007-07-30 Thread Renaud Waldura
Yeah, it's a surprise, isn't it? I'm afraid there isn't a good answer. http://wiki.apache.org/lucene-java/BooleanQuerySyntax The best practice appears to be to require parens everywhere to force the evaluation order. Not very satisfying, but it does work 100%. -Original Message- From:

Re: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Erick Erickson
Would this work? TokenStream ts = StandardAnalyzer.tokenStream(); while ((Token tok = ts.next()) != null) { do whatever } Best Erick On 7/30/07, Joe Attardi [EMAIL PROTECTED] wrote: Following up on my recent question. It has been suggested to me that I can run the query text through an

Re: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Joe Attardi
So then would I just concatenate the tokens together to form the query text? -- Joe Attardi [EMAIL PROTECTED] http://thinksincode.blogspot.com/ On 7/30/07, Erick Erickson [EMAIL PROTECTED] wrote: Would this work? TokenStream ts = StandardAnalyzer.tokenStream(); while ((Token tok =

RE: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Ard Schrijvers
So then would I just concatenate the tokens together to form the query text? You might better create a TermQuery for each token instead of concatenating, and combine them in a BooleanQuery and say wether all terms must or should occur. Very simple, see [1] Regards Ard [1]

a question for french analyzer

2007-07-30 Thread Chris Lu
Hi, I am not a French speaker, but here are some questions regarding French analyzer: Is there any analyzer that can do this? Analyze accentuated letters to non accentuated corresponding letters (é,è,ê,ë - e), so that search fenêtre (=window) found all docs with fenêtre or fenetre and search

Re: a question for french analyzer

2007-07-30 Thread Erick Erickson
Gosh, I sure hope not, because that would mean that we rolled our own for no good reason. We wound up just collapsing the input stream by substituting plain old 'e' for all the accented variants before indexing and before searching. Be *really* careful what character set you're using. Actually,

RE: a question for french analyzer

2007-07-30 Thread Samir Abdou
Hi, Take a look to the class ISOLatin1AccentFilter ! Add this to your analyzer and it should work ! Hope this will help, Samir -Message d'origine- De : Chris Lu [mailto:[EMAIL PROTECTED] Envoyé : lundi, 30. juillet 2007 20:06 À : java-user@lucene.apache.org Objet : a question for

Maximum phrase query?

2007-07-30 Thread Max Metral
I have a set of tags associated with content in my corpus. I also have normal text. Our system tries to figure out which words are tags and which are text, and falls back on text when tags fail. I'm wondering, is there anything in Lucene which might help disambiguate multi-word tags from text?

Re: java gc with a frequently changing index?

2007-07-30 Thread Tim Sturge
Thanks for the reply Erick, I believe it is the gc for four reasons: - I've tried the warmup approach alredy and it didn't change the situation. - The server completely pauses for several seconds. I run jstack to find out where the pause is, and it also pauses for several seconds before

Re: java gc with a frequently changing index?

2007-07-30 Thread Mark Miller
I believe there is an issue in JIRA that handles reopening an IndexReader without reopening segments that have not changed. On 7/30/07, Tim Sturge [EMAIL PROTECTED] wrote: Thanks for the reply Erick, I believe it is the gc for four reasons: - I've tried the warmup approach alredy and it

Re: java gc with a frequently changing index?

2007-07-30 Thread Mark Miller
And by the way, I cannot see it ever making sense to keep reopening an index reader every second or so. It has to be MUCH more efficient to even wait every 2 or 4 seconds...even that is going to be pretty nasty, but you have to allow for a bit of batch man. You will waste so much time opening

Re: a question for french analyzer

2007-07-30 Thread Chris Lu
Hi, Erick, I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip, and it works great! And I think it's the right way to go. Problems like You have to store the data raw for display purposes if you want the accents to show though will go away since Analyzer already have the

Re: java gc with a frequently changing index?

2007-07-30 Thread Tim Sturge
Oh, yeah, I know now :-). But I really do have a requirement to show search results from items that came in 5 seconds ago. We have an application where a common usage pattern is add an item navigate to another item search for the first item (to associate it with the second item) and the gap

RE: a question for french analyzer

2007-07-30 Thread Renaud Waldura
Being a French speaker, I will mention the following special cases: - plus ça change - plus ca change - œuf - oeuf - lætitia - laetitia But I just looked, and it looks like ISOLatin1AccentFilter handles these. Better test to be sure... --Renaud -Original Message- From: Chris Lu

Re: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Joe Attardi
What about the case where I want to search a MAC address? For example, 00:14:da:81:21:4f will be split by the StandardTokenizer as the tokens 00, 14, da, 81, 21, and 4f. Suppose I want to search for 00:14:da:81:21:4f. In the search box, I type 00:14:da:81:21:4f. But because these are all separate

Re: java gc with a frequently changing index?

2007-07-30 Thread Kay Roepke
Hi Tim! On Jul 25, 2007, at 8:41 PM, Tim Sturge wrote: I am indexing a set of constantly changing documents. The change rate is moderate (about 10 docs/sec over a 10M document collection with a 6G total size) but I want to be right up to date (ideally within a second but within 5 seconds

Re: Maximum phrase query?

2007-07-30 Thread Erick Erickson
not that I know of Erick On 7/30/07, Max Metral [EMAIL PROTECTED] wrote: I have a set of tags associated with content in my corpus. I also have normal text. Our system tries to figure out which words are tags and which are text, and falls back on text when tags fail. I'm wondering,

Re: Running query text through an Analyzer without QueryParser?

2007-07-30 Thread Erick Erickson
*SpanNearQueryfile:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/spans/SpanNearQuery.html#SpanNearQuery%28org.apache.lucene.search.spans.SpanQuery%5B%5D,%20int,%20boolean%29 *(SpanQueryfile:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/spans/SpanQuery.html[] clauses, int slop,

Re: a question for french analyzer

2007-07-30 Thread Erick Erickson
However, is there any special case that you have? Yes, the character set we use is, as I remember, MARC-8. Which I don't think is the ISOLatin, but since I didn't know about that filter when we had our problem, I didn't even look. Oh well, smarter/braver/lazier next time G... Which is why I

High CPU usage duing index and search

2007-07-30 Thread Chew Yee Chuang
Greetings All, I have been trying out Lucene recently and very happy with the search performance. But just notice that when Lucene performing search or index, the CPU usage on my machine raise to 100%, because of this issue, some of my others backend process will slow down eventually. Just

Problem in Lucene

2007-07-30 Thread Srinivasarao Vundavalli
Hi, I am using nutch index to search in lucene. One of my classes use makeStopTable method ( which is deprecated) of class StopFilter in org.apache.lucene.analysis. When I run my program with lucene 2.1.0 ~/j2sdk1.4.2/bin/java -classpath .:lucene-core-2.1.0.jar SearchFiles Exception in