Re: QueryParser behaviour ..

2006-02-17 Thread sergiu gordea

Yonik Seeley wrote:


From the user's point of view I think it will make sense to
build a phrase query only when the quotes are found in the search string.
   



You make an interesting point Sergiu.  Your proposal would increase
the expressive power of the QueryParser by allowing the construction
of either phrase queries or boolean queries when multiple tokens are
produced by analysis.

The main downside is that it's not backward compatible, and without
quotes (and hence phrase queries) many older queries will produce
worse results.  I also think that a majority of the time, when
multiple tokens are produced, you do want a phrase search (or at least
a sloppy one).

Of course, the backward compatible thing can be fixed via a flag on
the query parser that defaults to the old behavior.
 

you are right, it can be a property of QueryParser similar to the AND/OR 
behaviour.
This will solve also backward compatibility ... and will implement the 
behaiour I espect also.


Best,

Sergiu


-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 





Re: QueryParser behaviour ..

2006-02-15 Thread sergiu gordea

Chris Hostetter wrote:


: Exactly this is my question, why the QueryParser creates a Phrase query
: when he gets several tokens from analyzer
: and not a BooleanQuery?

Because if it did that, there would be no way to write phrase queries :)
 


I'm not very sure about this ...


QueryParser only returns a BooleanQuery when *it* can tell you have
several clauses.  For each chunk of text that it thinks of as one
continuous piece of text (either because it doesn't contain whitespaces or
 

wouldn't be better to let the analyzer decide if there is a continuous 
piece of text?

and to build PhraseQueries only when the quote sign is found?


because it has quotes around it) it gives it to the analyzer, if the
analyzer says there are multiple Terms there then QueryParser makes a
PhraseQuery out of it.   or in a nutshell:
  1) if the Parser detects multiple terms, it makes a boolean query
  2) if the Analyzer detects multiple terms, it makes a phrase query
 

this is related with my comment above. From the user's point of view I 
think it will make sense to

build a phrase query only when the quotes are found in the search string.

I think there are pro and con arguments, for unifying the behaviour.
I would be happy if the QueryParser wouldn't create phrase queries if i 
didn't explicitly  asked to do it.


Does someone have a different opinion?


if you don't like this behavior, it can all be circumvented by overriding
getFieldQuery().  you don't even have to teal with the analyzer if you
don't want to.  just call super.getFieldQuery() and if you get back a
PhraseQuery take it apart and build TermQueries wrapped in a boolean
query.
 

Well,  there is  all  the time  a work around.  It is obvious that 
searching for word1,word2,word3 was a
silly mistake, but I needed one hour to find why a PhraseQuery is 
created when no quotes existed in the query string.


So ... my opinion is that what I suggest will improve the usability of 
lucene, I hope that  the  lucene developers  share 
my opinion.


Best,

Sergiu





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser behaviour ..

2006-02-14 Thread sergiu gordea

Chris Hostetter wrote:


:  I built a wrong query string word1,word2,word3 instead of word1
: word2 word3
: therefore I got a wrong query:  field:word1 word2 word3 instead of
: field:word1 field:word2  field:word3.
:
:  Is this an espected behaviour?
:  I used Standard analyzer, probably therefore, the comas were replaced
: with spaces.

the commas weren't replaced ... your analyzer split on them and threw
them away.

they key to understanding why that resulted in a phrase query instead of
three term queries is that QueryParser doesn't treat comma as a special
character, so it saw the string word1,word2,word3 and gave it to your
analyzer.  Since your analyzer gave back several tokens QueryParser built
a phrase query out of it.
 

Exactly this is my question, why the QueryParser creates a Phrase query 
when he gets several tokens from analyzer

and not a BooleanQuery?


likewise, in the case of word1 word2 word3 the quotes *are* a special
character to QueryParser which tells it it should *not* split on the
spaces betwen the quotes, and hand the individual words to the analyzer.
instead it hands the whole thing to the analyzer as one big string again.

 

It was not this situation, the string was without quotes (String 
searchString =  word1,word2,word3; )

I just preserved java quotes to delimit the string.


:  Is this a bug? Does it make sense to indicate this situation through a
: Parse Exception?

a parse error should really onl come up when the query parser sees a
character that it does consider special, but sees it in a place that
doesn't make sense (or doesn't see one in a plkace it needs one).  in this
case QP is perfectly happy to let you query for a word that contains a
comma -- it's your analyzer that's putting it's foot down and saying that
can't be in a word.
 

Ok .. it is not the case of ParseException, should situations like this 
(change from TermQuery to PhraseQuery)
indicated in log files? I mean, this will help developers to debug their 
code easier.


Best,

Sergiu



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QueryParser behaviour ..

2006-02-10 Thread sergiu gordea

 Hi all,

I built a wrong query string word1,word2,word3 instead of word1 
word2 word3
therefore I got a wrong query:  field:word1 word2 word3 instead of  
field:word1 field:word2  field:word3.


Is this an espected behaviour?
I used Standard analyzer, probably therefore, the comas were replaced 
with spaces.

Indeded was no space between the words, just comas.

Is this a bug? Does it make sense to indicate this situation through a 
Parse Exception?


Best,

 Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-06-07 Thread sergiu gordea

Tansley, Robert wrote:


Hi all,

The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it.  Now the system is being used globally, it needs to
support multi-language indexing.

I've looked through the mailing list archives etc. and it seems it's
easy to plug in analyzers for different languages.

What if we're trying to index multiple languages in the same site?  Is
it best to have:

1/ one index for all languages
2/ one index for all languages, with an extra language field so searches
can be constrained to a particular language
3/ separate indices for each language?

I don't fully understand the consequences in terms of performance for
1/, but I can see that false hits could turn up where one word appears
in different languages (stemming could increase the changes of this).
Also some languages' analyzers are quite dramatically different (e.g.
the Chinese one which just treats every character as a separate
token/word).
 


On the other hand, if people are searching for proper nouns in metadata
(e.g. DSpace) it may be advantageous to search all languages at once.


I'm also not sure of the storage and performance consequences of 2/.

Approach 3/ seems like it might be the most complex from an
implementation/code point of view.  
 

But this will be the most robust solution. You have to differentiate 
between languages anyway,
and as you pointed here, you can differentiate by adding a Keyword field 
for language, or you can create different

indexes.

If you need to use complex search strings over multiple fields and 
indexes then I recommend you to use the QueryParser
to compute the search string. When you instantiate a QueryPArser you 
will need to provide an analyzer, that will be different

for different languages.

I think that the differences in performance won't be noticable between  
2nd and 3rd solutions, but from maintenance point of

view, I would choose the third solution.

Of course there are other factors that must be take in account when 
designing such an application:
number of documents to be indexed, number of document fields, index 
change frequency, server load (number of concurrent sessions), etc.


Hope this hints help you a little,

Best,

Sergiu




Does anyone have any thoughts or recommendations on this?

Many thanks,

Robert Tansley / Digital Media Systems Programme / HP Labs
 http://www.hpl.hp.com/personal/Robert_Tansley/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding minimum and maximum value of a field?

2005-06-07 Thread sergiu gordea



Kevin Burton wrote:

I have an index with a date field.  I want to quickly find the minimum 
and maximum values in the index.


Is there a quick way to do this?  I looked at using TermInfos and 
finding the first one but how to I find the last?


I also tried the new sort API and the performance was horrible :-/

Any ideas?


You may keep a history of the MIN and MAX values in an external file.
Let's say, you can write in a text file the MIN_DATE and MAX_DATE,
and keep them up to date when indexing, deleting documents.

Best,

Sergiu



Kevin




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: *term (SuffixQeuries)

2005-05-25 Thread sergiu gordea


  Hi all,

I send this email to make a correction to the solution that enables 
SuffixQeuries


The definition of the WILDTERM was a buggy one, it splitted a term in 
two terms
e.g   term:te*st was parsed to term:te* term:st, of course this 
was wrong.


HERE is the right way to do it ...

DEFAULT TOKEN : {
...
| WILDTERM:   (([ *, ? ])* _TERM_START_CHAR (_TERM_CHAR | ( [ 
*, ? ] ) )* ) 

...

Erik (or other lucene developer), can you please update the Comments in 
the QueryParser.jj to include this correction?
The existing suggestion allows doesn't throw parse exception if the user 
tries to use *- or this kind of combinations
and throws some OutOfBoundsException or NPE ..., my definition throws 
ParseException
that can be catched and displayed that the given string is an invalid 
search string ...


What needs to be done is to change :

// OG: to support prefix queries:
// http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12137
// Change from:
// | WILDTERM:  _TERM_START_CHAR
//  (_TERM_CHAR | ( [ *, ? ] ))* 
// To:
//
// | WILDTERM:  (_TERM_CHAR | ( [ *, ? ] ))* 

// OG: to support prefix queries:
// http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12137
// Change from:
// | WILDTERM:  _TERM_START_CHAR
//  (_TERM_CHAR | ( [ *, ? ] ))* 
// To:
//
// | WILDTERM:  (_TERM_CHAR | ( [ *, ? ] ))* 
//
//SG: or better, this definition
//| WILDTERM:   (([ *, ? ])* _TERM_START_CHAR (_TERM_CHAR | ( [ 
*, ? ] ) )* ) 


sergiu gordea wrote:


Tim Lebedkov (UPK) wrote:


Hi,

is there a way to make QueryParser accept *term?
 


yes, if you apply a patch the lucene sources.
Search for *term search in lucene archive.

Best,

 Sergiu


thank you
--Tim

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser refactoring

2005-03-09 Thread sergiu gordea
Doug Cutting wrote:
sergiu gordea wrote:
So .. here is an example of how I parse a simple query string 
provided by a user ...

the user checks a few flags and writes test ko AND NOT bo
and the resulting query.toString() is saved in the database:
+(+(subject:test description:test keywordsTerms:test 
koProperties:test attachmentData:test) +(subject:ko description:ko 
keywordsTerms:ko koProperties:ko attachmentData:ko) -(subject:bo* 
description:bo* keywordsTerms:bo* koProperties:bo* 
attachmentData:bo*)) +creator:2 
+classType:package.share.om.knowledgeobject +skillLevel:0 
+(keywords:1000 keywords:1020)

I think you agree that is better to be saved in the database instead 
of  creating a
CustomQuery class that implements Serializable and save it in the 
database.

Your application will be more robust if you instead stored the checked 
flags and test ko AND NOT bo in the database and then re-generate 
the Lucene query as needed.

For example, if you wanted to add an author field that was searched 
by default, then all of the queries in your database would be 
invalid.  Also, more to the point, if Query.toString() changes, the 
semantics of your queries might change, or if the QueryParser changes 
they might even become unparsable.
You are right ...  The problem is that the generated String is used in 
extended search functionality, which is quite often improved. Storing
the  test ko AND NOT bo string is not enough to regenerate the query, 
because all the other components of the query depend on user data.
Yes, it is better to store 2 Strings  in the database test ko AND NOT 
bo and  +creator:2 +classType:package.share.om.knowledgeobject 
+skillLevel:0 +(keywords:1000 keywords:1020)
in the database, and then I'll be able to reconstrunct the query at runtime.

I chose to Store the query.toString() because the parsing of this string 
required to write just one line of code  and it was working perfectly 
(one line of code means also less maintainance effort).

I'm not a very experenced software developer (I'm still young :)) ), but 
I've already  meet some sitautions when I needed to make reversible some 
transformations to be reversible
(I mean something like Query = String = Query, with the constraint 
initial query equals final query).

Ok ... I give up ... if  this feature is to hard to be implemented, the 
soltution will be to work around in my source code to make it work.


The general rule is that the QueryParser should only be used to 
directly parse user input.  Programs should not generate strings to 
pass to QueryParser.  Query.toString() is a program-generated string.  
If you must save a query, save the user's input.
You are right, but I already explained why it was much more easy to 
store the generated query String.
And .. there are also some other things to consider. What I explained 
here is the implementation of the Saved Search concept .
The search must return the documents that were found first time + new 
added documents, even if the structure of the index is changed

This is my problem, not lucene's, I just wanted to show you how useful 
is for me the fact that the Query = String transformation is reversible.
Of course there are alternativa solutions all the time.

Best,
Sergiu
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]