Re: QueryParser behaviour ..
Yonik Seeley wrote: From the user's point of view I think it will make sense to build a phrase query only when the quotes are found in the search string. You make an interesting point Sergiu. Your proposal would increase the expressive power of the QueryParser by allowing the construction of either phrase queries or boolean queries when multiple tokens are produced by analysis. The main downside is that it's not backward compatible, and without quotes (and hence phrase queries) many older queries will produce worse results. I also think that a majority of the time, when multiple tokens are produced, you do want a phrase search (or at least a sloppy one). Of course, the backward compatible thing can be fixed via a flag on the query parser that defaults to the old behavior. you are right, it can be a property of QueryParser similar to the AND/OR behaviour. This will solve also backward compatibility ... and will implement the behaiour I espect also. Best, Sergiu -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser behaviour ..
Chris Hostetter wrote: : Exactly this is my question, why the QueryParser creates a Phrase query : when he gets several tokens from analyzer : and not a BooleanQuery? Because if it did that, there would be no way to write phrase queries :) I'm not very sure about this ... QueryParser only returns a BooleanQuery when *it* can tell you have several clauses. For each chunk of text that it thinks of as one continuous piece of text (either because it doesn't contain whitespaces or wouldn't be better to let the analyzer decide if there is a continuous piece of text? and to build PhraseQueries only when the quote sign is found? because it has quotes around it) it gives it to the analyzer, if the analyzer says there are multiple Terms there then QueryParser makes a PhraseQuery out of it. or in a nutshell: 1) if the Parser detects multiple terms, it makes a boolean query 2) if the Analyzer detects multiple terms, it makes a phrase query this is related with my comment above. From the user's point of view I think it will make sense to build a phrase query only when the quotes are found in the search string. I think there are pro and con arguments, for unifying the behaviour. I would be happy if the QueryParser wouldn't create phrase queries if i didn't explicitly asked to do it. Does someone have a different opinion? if you don't like this behavior, it can all be circumvented by overriding getFieldQuery(). you don't even have to teal with the analyzer if you don't want to. just call super.getFieldQuery() and if you get back a PhraseQuery take it apart and build TermQueries wrapped in a boolean query. Well, there is all the time a work around. It is obvious that searching for word1,word2,word3 was a silly mistake, but I needed one hour to find why a PhraseQuery is created when no quotes existed in the query string. So ... my opinion is that what I suggest will improve the usability of lucene, I hope that the lucene developers share my opinion. Best, Sergiu -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser behaviour ..
Chris Hostetter wrote: : I built a wrong query string word1,word2,word3 instead of word1 : word2 word3 : therefore I got a wrong query: field:word1 word2 word3 instead of : field:word1 field:word2 field:word3. : : Is this an espected behaviour? : I used Standard analyzer, probably therefore, the comas were replaced : with spaces. the commas weren't replaced ... your analyzer split on them and threw them away. they key to understanding why that resulted in a phrase query instead of three term queries is that QueryParser doesn't treat comma as a special character, so it saw the string word1,word2,word3 and gave it to your analyzer. Since your analyzer gave back several tokens QueryParser built a phrase query out of it. Exactly this is my question, why the QueryParser creates a Phrase query when he gets several tokens from analyzer and not a BooleanQuery? likewise, in the case of word1 word2 word3 the quotes *are* a special character to QueryParser which tells it it should *not* split on the spaces betwen the quotes, and hand the individual words to the analyzer. instead it hands the whole thing to the analyzer as one big string again. It was not this situation, the string was without quotes (String searchString = word1,word2,word3; ) I just preserved java quotes to delimit the string. : Is this a bug? Does it make sense to indicate this situation through a : Parse Exception? a parse error should really onl come up when the query parser sees a character that it does consider special, but sees it in a place that doesn't make sense (or doesn't see one in a plkace it needs one). in this case QP is perfectly happy to let you query for a word that contains a comma -- it's your analyzer that's putting it's foot down and saying that can't be in a word. Ok .. it is not the case of ParseException, should situations like this (change from TermQuery to PhraseQuery) indicated in log files? I mean, this will help developers to debug their code easier. Best, Sergiu -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser behaviour ..
Hi all, I built a wrong query string word1,word2,word3 instead of word1 word2 word3 therefore I got a wrong query: field:word1 word2 word3 instead of field:word1 field:word2 field:word3. Is this an espected behaviour? I used Standard analyzer, probably therefore, the comas were replaced with spaces. Indeded was no space between the words, just comas. Is this a bug? Does it make sense to indicate this situation through a Parse Exception? Best, Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing multiple languages
Tansley, Robert wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the mailing list archives etc. and it seems it's easy to plug in analyzers for different languages. What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language? I don't fully understand the consequences in terms of performance for 1/, but I can see that false hits could turn up where one word appears in different languages (stemming could increase the changes of this). Also some languages' analyzers are quite dramatically different (e.g. the Chinese one which just treats every character as a separate token/word). On the other hand, if people are searching for proper nouns in metadata (e.g. DSpace) it may be advantageous to search all languages at once. I'm also not sure of the storage and performance consequences of 2/. Approach 3/ seems like it might be the most complex from an implementation/code point of view. But this will be the most robust solution. You have to differentiate between languages anyway, and as you pointed here, you can differentiate by adding a Keyword field for language, or you can create different indexes. If you need to use complex search strings over multiple fields and indexes then I recommend you to use the QueryParser to compute the search string. When you instantiate a QueryPArser you will need to provide an analyzer, that will be different for different languages. I think that the differences in performance won't be noticable between 2nd and 3rd solutions, but from maintenance point of view, I would choose the third solution. Of course there are other factors that must be take in account when designing such an application: number of documents to be indexed, number of document fields, index change frequency, server load (number of concurrent sessions), etc. Hope this hints help you a little, Best, Sergiu Does anyone have any thoughts or recommendations on this? Many thanks, Robert Tansley / Digital Media Systems Programme / HP Labs http://www.hpl.hp.com/personal/Robert_Tansley/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Finding minimum and maximum value of a field?
Kevin Burton wrote: I have an index with a date field. I want to quickly find the minimum and maximum values in the index. Is there a quick way to do this? I looked at using TermInfos and finding the first one but how to I find the last? I also tried the new sort API and the performance was horrible :-/ Any ideas? You may keep a history of the MIN and MAX values in an external file. Let's say, you can write in a text file the MIN_DATE and MAX_DATE, and keep them up to date when indexing, deleting documents. Best, Sergiu Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: *term (SuffixQeuries)
Hi all, I send this email to make a correction to the solution that enables SuffixQeuries The definition of the WILDTERM was a buggy one, it splitted a term in two terms e.g term:te*st was parsed to term:te* term:st, of course this was wrong. HERE is the right way to do it ... DEFAULT TOKEN : { ... | WILDTERM: (([ *, ? ])* _TERM_START_CHAR (_TERM_CHAR | ( [ *, ? ] ) )* ) ... Erik (or other lucene developer), can you please update the Comments in the QueryParser.jj to include this correction? The existing suggestion allows doesn't throw parse exception if the user tries to use *- or this kind of combinations and throws some OutOfBoundsException or NPE ..., my definition throws ParseException that can be catched and displayed that the given string is an invalid search string ... What needs to be done is to change : // OG: to support prefix queries: // http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12137 // Change from: // | WILDTERM: _TERM_START_CHAR // (_TERM_CHAR | ( [ *, ? ] ))* // To: // // | WILDTERM: (_TERM_CHAR | ( [ *, ? ] ))* // OG: to support prefix queries: // http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12137 // Change from: // | WILDTERM: _TERM_START_CHAR // (_TERM_CHAR | ( [ *, ? ] ))* // To: // // | WILDTERM: (_TERM_CHAR | ( [ *, ? ] ))* // //SG: or better, this definition //| WILDTERM: (([ *, ? ])* _TERM_START_CHAR (_TERM_CHAR | ( [ *, ? ] ) )* ) sergiu gordea wrote: Tim Lebedkov (UPK) wrote: Hi, is there a way to make QueryParser accept *term? yes, if you apply a patch the lucene sources. Search for *term search in lucene archive. Best, Sergiu thank you --Tim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser refactoring
Doug Cutting wrote: sergiu gordea wrote: So .. here is an example of how I parse a simple query string provided by a user ... the user checks a few flags and writes test ko AND NOT bo and the resulting query.toString() is saved in the database: +(+(subject:test description:test keywordsTerms:test koProperties:test attachmentData:test) +(subject:ko description:ko keywordsTerms:ko koProperties:ko attachmentData:ko) -(subject:bo* description:bo* keywordsTerms:bo* koProperties:bo* attachmentData:bo*)) +creator:2 +classType:package.share.om.knowledgeobject +skillLevel:0 +(keywords:1000 keywords:1020) I think you agree that is better to be saved in the database instead of creating a CustomQuery class that implements Serializable and save it in the database. Your application will be more robust if you instead stored the checked flags and test ko AND NOT bo in the database and then re-generate the Lucene query as needed. For example, if you wanted to add an author field that was searched by default, then all of the queries in your database would be invalid. Also, more to the point, if Query.toString() changes, the semantics of your queries might change, or if the QueryParser changes they might even become unparsable. You are right ... The problem is that the generated String is used in extended search functionality, which is quite often improved. Storing the test ko AND NOT bo string is not enough to regenerate the query, because all the other components of the query depend on user data. Yes, it is better to store 2 Strings in the database test ko AND NOT bo and +creator:2 +classType:package.share.om.knowledgeobject +skillLevel:0 +(keywords:1000 keywords:1020) in the database, and then I'll be able to reconstrunct the query at runtime. I chose to Store the query.toString() because the parsing of this string required to write just one line of code and it was working perfectly (one line of code means also less maintainance effort). I'm not a very experenced software developer (I'm still young :)) ), but I've already meet some sitautions when I needed to make reversible some transformations to be reversible (I mean something like Query = String = Query, with the constraint initial query equals final query). Ok ... I give up ... if this feature is to hard to be implemented, the soltution will be to work around in my source code to make it work. The general rule is that the QueryParser should only be used to directly parse user input. Programs should not generate strings to pass to QueryParser. Query.toString() is a program-generated string. If you must save a query, save the user's input. You are right, but I already explained why it was much more easy to store the generated query String. And .. there are also some other things to consider. What I explained here is the implementation of the Saved Search concept . The search must return the documents that were found first time + new added documents, even if the structure of the index is changed This is my problem, not lucene's, I just wanted to show you how useful is for me the fact that the Query = String transformation is reversible. Of course there are alternativa solutions all the time. Best, Sergiu Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]