question about field equality in query

2007-04-11 Thread Mohammad Norouzi
Hi is it possible (or a trickery way) to search with a given query in which we can set an equality for two fields for example: Document: field1 field2field3 field4 Query: field1:test phrase AND field2:test AND field3:field4 in this query we said that do

Re: How to access Levenstein distance number?

2007-04-11 Thread Grant Ingersoll
Have you looked at the explains to see what is coming out of the FuzzyQuery? Also, are you using Hits to get that score? Scores get normalized to 1 by that process. -Grant On Apr 11, 2007, at 2:06 AM, Michael Barbarelli wrote: Hello. I am using Lucene to submit fuzzy queries against an

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-11 Thread karl wettin
11 apr 2007 kl. 04.21 skrev Grant Ingersoll: Would some sort of caching strategy work? How big is your overall collection? Also, lately there have been a few threads on TV (term vector) performance. I don't recall anyone having actively profiled or examined it for improvements, so

Re: Issue with : Searcher.search() returning Hits of same length for different searches

2007-04-11 Thread Erick Erickson
Well, there's nothing here to help you with, since you haven't provided any information to diagnose. Like: What queries are actually produced in the different cases? Use query.toString(). I'm immediately suspicious of any statement that my custom code shouldn't be the problem. Try the test

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-11 Thread Grant Ingersoll
On Apr 11, 2007, at 9:07 AM, karl wettin wrote: 11 apr 2007 kl. 04.21 skrev Grant Ingersoll: Would some sort of caching strategy work? How big is your overall collection? Also, lately there have been a few threads on TV (term vector) performance. I don't recall anyone having actively

Re: How to access Levenstein distance number?

2007-04-11 Thread Michael Barbarelli
Hi Grant. Yes, I'm getting the score from the Hits collection. And yes, they get normalized to 1; which is what I don't want. Or, I can leave the Hits objects as is, but I know Lucene also must calculate a raw difference as part of the overall score calculation. How can I get at that value?

Re: How to access Levenstein distance number?

2007-04-11 Thread Erick Erickson
Go for a HitCollector. In particular, TopDocs will give you the raw scores. Erick On 4/11/07, Michael Barbarelli [EMAIL PROTECTED] wrote: Hi Grant. Yes, I'm getting the score from the Hits collection. And yes, they get normalized to 1; which is what I don't want. Or, I can leave the Hits

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-04-11 Thread Daniel Einspanjer
Not really. The explain scores aren't normalized and I also couldn't find a way to get the explain data as anything other than a whitespace formatted text blob from Solr. Keep in mind that they need confidence factors from one query to the next. With the explain scores, they can have wildly

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-04-11 Thread Daniel Einspanjer
Oh geeze. Gmail ripped my pretty table to shreds. Let me try again: A -- id title title score director director score year year score overall score B

Re: Standard Parser Behavior

2007-04-11 Thread Walt Stoneburner
Mike Klaas elaborates on syntax: +(-A +B) - must match (-A +B) - must contain B and must not contain A -(-A +B) - must not match (-A +B) - must not (match B and not contain A) Ok, the take-away from this I'm getting is that these clauses read very much like English and behave just the same.

Re: How to access Levenstein distance number?

2007-04-11 Thread Michael Barbarelli
Thank you Erick! Will give it a shot! On 4/11/07, Erick Erickson [EMAIL PROTECTED] wrote: Go for a HitCollector. In particular, TopDocs will give you the raw scores. Erick On 4/11/07, Michael Barbarelli [EMAIL PROTECTED] wrote: Hi Grant. Yes, I'm getting the score from the Hits

Lucene BOF @ apachecon.eu ?

2007-04-11 Thread Sami Siren
Are there any plans to put together a Lucene BOF at Amsterdam? -- Sami Siren - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene BOF @ apachecon.eu ?

2007-04-11 Thread Mathias Herberts
+1 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Issue with : Searcher.search() returning Hits of same length for different searches

2007-04-11 Thread Lokeya
Thanks for your reply. I should have given more information and will keep in mind this for my future queries. Regarding this one I have already done most of things you have asked like: 1. I am confirming what query is getting executed by using query.toString() 2. I read lot of posts in the forum

Re: Issue with : Searcher.search() returning Hits of same length for different searches

2007-04-11 Thread Lokeya
Thanks for your reply. I should have given more information and will keep in mind this for my future queries. Regarding this one I have already done most of things you have asked like: 1. I am confirming what query is getting executed by using query.toString() 2. I read lot of posts in the forum

Re: strange idf in Lucene 2.1

2007-04-11 Thread Yonik Seeley
On 4/11/07, Koji Sekiguchi [EMAIL PROTECTED] wrote: In the program, I added these three documents to the index, then deleted all of them, and then added them to the index on purpose. If I optimize the index, idf gets into 1.0 with Lucene 2.1 (uncomment in the program). Is it a feature?

strange idf in Lucene 2.1

2007-04-11 Thread Koji Sekiguchi
Hello, I have the following three documents in my index: - Java programming is required to write Lucene application. - Java is a popular computer language. I like Java. - Perl is not a kind of jewelry. It is a programming language. With Lucene 2.0, if I search java and print explanation, the

Re: Standard Parser Behavior

2007-04-11 Thread Chris Hostetter
: here is that it's not that I'm finding different documents, but rather it's : the same set and they will be ranked differently. : : Can you point me at a resource that explains the ranking and coord factors? : I'm trying to understand scoring better. Going to the BooleanQuery The best

Re: Issue with : Searcher.search() returning Hits of same length for different searches

2007-04-11 Thread Daniel Naber
On Wednesday 11 April 2007 18:51, Lokeya wrote: Thanks for your reply. I should have given more information and will keep in mind this for my future queries. If nothing else helps, please write a small, standalone test-case that shows the problem. This can then easily be debugged by someone

Term frequency

2007-04-11 Thread sai hariharan
Hi, I've just started using Lucene. Can anybody assist me in calculating the term frequencies of the terms(words) that occur in a document(*.txt), when a particular doc is submitted. Say when i submit sample.txt , i should first analyze the document with a standard anlyzer, then the term

Unicode Normalization

2007-04-11 Thread David Woodward
Hi. I have encountered a problem searching in my application because of inconsistant unicode normalization forms in the corpus (and the queries). I would like to normalize to form NFKD in an analyzer (I think). I was thinking about creating a filter similar to the lowercasefilter that would do

Re: Term frequency

2007-04-11 Thread Grant Ingersoll
Add Term Vectors to your Field during indexing. See the Field constructors. To get a Term Vector out, see IndexReader.getTermFreqVector method. -Grant On Apr 11, 2007, at 3:23 PM, sai hariharan wrote: Hi, I've just started using Lucene. Can anybody assist me in calculating the term

Turning PrefixQuery into a TermQuery

2007-04-11 Thread Steffen Heinrich
Hello Lucene users, I'm rather new to lucene and java but have done work with other search engines some time before. Right now I'm trying my hands (and luck) on a 'search as you type'- sort of high performance search a la GoogleSuggest. There meanwhile are on the net, a number of examples for

Re: Turning PrefixQuery into a TermQuery

2007-04-11 Thread Antony Bowesman
Steffen Heinrich wrote: Normally an IndexWriter uses only one default Analyzer for all its tokenizing businesses. And while it is appearantly possible to supply a certain other instance when adding a specific document there seems to be no way to use different analyzers on different fields

Re: Turning PrefixQuery into a TermQuery

2007-04-11 Thread Erick Erickson
Rather than using a search, have you thought about using a TermEnum? It's much, much, much faster than a query. What it allows you to do is enumerate the terms in the index on a per-field basis. Essentially, this is what happens when you do a PrefixQuery as BooleanClauses are added, but you have

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-04-11 Thread Chris Hostetter
: Not really. The explain scores aren't normalized and I also couldn't : find a way to get the explain data as anything other than a whitespace : formatted text blob from Solr. Keep in mind that they need confidence the defualt way Solr dumps score explainations is just as plain text, but the

Re: Unicode Normalization

2007-04-11 Thread Chris Hostetter
: I have encountered a problem searching in my application because of : inconsistant unicode normalization forms in the corpus (and the : queries). I would like to normalize to form NFKD in an analyzer (I : think). I was thinking about creating a filter similar to the i'm very naive to the

Re: Standard Parser Behavior

2007-04-11 Thread Daniel Noll
Walt Stoneburner wrote: Does +(A1 A2 A3) +(B1 B2 B3) -(C1 C2 C3) find documents that have at least one A -and- at least one B, but never any Cs? ...to which I'm now given to understand the answer is yes. And understand why. Well, that example would follow standard boolean logic if that's the

OutOfMemory Error while searching Index - Help Appreciated.

2007-04-11 Thread Lokeya
I have gone through the mailing list in search of posts for this error. Though there are many, I feel my problem is little different from that and like to get some advice on this. Details: 1. Using a machine with RAM 2GB 2. Created an Index of size 200 MB. 3. Trying to do a search on this for

Re: OutOfMemory Error while searching Index - Help Appreciated.

2007-04-11 Thread Erick Erickson
That certainly seems odd. How much memory are you allocating your JVM? Erick On 4/11/07, Lokeya [EMAIL PROTECTED] wrote: I have gone through the mailing list in search of posts for this error. Though there are many, I feel my problem is little different from that and like to get some advice

Re: strange idf in Lucene 2.1

2007-04-11 Thread Koji Sekiguchi
Yonik, Thank you for your explanation. In passing, I realized this issue by my customer. They are using Solr. To reproduce the issue with Solr, post exampledocs/*.xml twice and issue a query with q=ipoddebugQuery=on. This should be the same for Lucene 2.0 and 2.1. I understand. But I think we

Re: OutOfMemory Error while searching Index - Help Appreciated.

2007-04-11 Thread Lokeya
It is using the default size allocated by the OS, which I don't have any idea how much exactly. But when I use the -Xmx1024m and run this is not occuring. Also I make some change in that loop now keeping only the Document hitDoc = hits.doc(i); line and thats where it starts throwing error. But I

Re: Unicode Normalization

2007-04-11 Thread Mike Klaas
On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I have encountered a problem searching in my application because of : inconsistant unicode normalization forms in the corpus (and the : queries). I would like to normalize to form NFKD in an analyzer (I : think). I was thinking about

Re: Unicode Normalization

2007-04-11 Thread Yonik Seeley
On 4/11/07, Mike Klaas [EMAIL PROTECTED] wrote: Unicode characters do not map precisely to code points: a single character can often be represented via a single codepoint or a combination of two (surrogate pair). I normally hear surrogates in the context of UTF-16 after the code point space

Re: Unicode Normalization

2007-04-11 Thread Daniel Noll
Yonik Seeley wrote: have no idea how java's String class handles this--I doubt it does any intelligent normalization. UTF-16 surrogates are handled as of Java5. And as of Java6 we have the java.text.Normalizer utility. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW

Re: Unicode Normalization

2007-04-11 Thread Mike Klaas
On 4/11/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 4/11/07, Mike Klaas [EMAIL PROTECTED] wrote: Unicode characters do not map precisely to code points: a single character can often be represented via a single codepoint or a combination of two (surrogate pair). I normally hear surrogates