RE: Hits.score mystery

2007-11-01 Thread Tom Conlon
Hi Grant, but you should have a look at Searcher.explain() I was half-expecting this answer. :( The query is very basic and the scoring seems completely arbitrary. Documents with the same number of ocurrences and (seemingly) distribution are being given widely different scores. Chris

Re: Hits.score mystery

2007-11-01 Thread Daniel Naber
On Wednesday 31 October 2007 19:14, Tom Conlon wrote: 119.txt 17.865013    97%    (13 occurences) 45.txt  8.600986 47%  (18 occurences) 45.txt might be a document with more therms so that its score is lower although it contains more matches. Regards Daniel --

RE: Hits.score mystery

2007-11-01 Thread Tom Conlon
Thanks Daniel, I'm using Searcher.explain() luke to try to understand the reasons for the score. -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: 01 November 2007 08:19 To: java-user@lucene.apache.org Subject: Re: Hits.score mystery On Wednesday 31 October 2007

Question regarding proximity search

2007-11-01 Thread Sonu SR
Hi, I got confused of proximity search. I am getting different results for the queries TTL:test device~2 and TTL:device test~2 I expect same result for the above two queries. Is there any importance of position of terms in a proximity query? Anybody please help me how lucene exactly handles

Re: Question regarding proximity search

2007-11-01 Thread Daniel Naber
On Thursday 01 November 2007 10:45, Sonu SR wrote: I got confused of proximity search. I am getting different results for the queries TTL:test device~2 and TTL:device test~2 Order is significant, this is described here:

RE: Hits.score mystery

2007-11-01 Thread Tom Conlon
The reason seems to be that I found I needed to implement an analyser that lowercases terms as well as *not* ignoring trailing characters such as #, +. (i.e. I needed to match C# and C++) public final class LowercaseWhitespaceAnalyzer extends Analyzer { public TokenStream tokenStream(String

Re: Best way to count tokens

2007-11-01 Thread Cool Coder
This is what I am looking for prior to adding into index. SO that it can help me to display in my site first 10 tokens that has got maximum occurences in my index. In otherword, user can add weightage to these terms. - BR Karl Wettin [EMAIL PROTECTED] wrote: 31 okt 2007 kl. 15.18

Re: Best way to count tokens

2007-11-01 Thread Karl Wettin
1 nov 2007 kl. 18.09 skrev Cool Coder: prior to adding into index Easiest way out would be to add the document to a temporary index and extract the term frequency vector. I would recommend using MemoryIndex. You could also tokenize the document and pass the data to a TermVectorMapper.

Re: Best way to count tokens

2007-11-01 Thread Cool Coder
Currently I have extended StandardAnalyzer and counting tokens in the following way. But the index is not getting created , though I call tokenStream.reset(). I am not sure whether reset() on token stream works or not??? I am debugging now public TokenStream tokenStream(String fieldName,

Re: problem undestanding the hits.score

2007-11-01 Thread Erick Erickson
What leads you to expect that ordering? Scoring in Lucene is NOT simply counting the number of times a word appears. That said, I really have no clue how the scoring algorithm works since it's always been good enough for me. But if you search the mail archive for scoring, you'll find a wealth of

Re: Hits.score mystery

2007-11-01 Thread Erick Erickson
Well, you might have to pre-process your strings before you give them to an analyzer. Or roll your own analyzer. What you're asking for, in effect, is an analyzer that does exactly what I want it to, nothing more and nothing less. But the problem is that there is nothing general about what you

Re: problem undestanding the hits.score

2007-11-01 Thread Mark Miller
There are many factors that go into scoring. Erick gave a nice link that will help you out. Also, check out Query.explain(). That will tell you how your score was resolved. To give you a start, normally shorter fields are preferred...finding a keyword in a short title is usually more

Re: Hits.score mystery

2007-11-01 Thread Mark Miller
One of many options is to copy the StandardAnalyzer but change it so that + and # are considered letters. Just add + and # to the LETTER definition in the JavaCC file if you are using a release, or the JFlex file if you are working off Trunk (your prob using a release but the new JFlex

Re: Best way to count tokens

2007-11-01 Thread Mark Miller
reset is optional. StandardAnalyzer does not implement it. Check out CachingTokenFilter and wrap StandardAnalzyer in it. Cool Coder wrote: Currently I have extended StandardAnalyzer and counting tokens in the following way. But the index is not getting created , though I call

Parsing text containing forward slash and wildcard

2007-11-01 Thread Vs_Inf
Hi, Using StandardAnalyzer, when we indexed the text /123xcv, QueryParser.parse() produced 123xcv. During searching using the same Analyzer, parsing a search text of /123 produced 123 but parsing /123* produces /123*. How can i get a parser output of 123* when parsing /123*? Thanks for your