Giang,

        Is it possible you switched up the numbers?  Shouldn't it be:

N(hello) = 77
N(world) = 894
N(hello world) = 45
N(hello OR world) = 926

        If so then I agree that it seems to work.  I'd be very interested in 
seeing this added back into nutch.  The instructions for creating a patch are 
here http://wiki.apache.org/nutch/HowToContribute.

Jake.

-----Original Message-----
From: Nguyen Ngoc Giang [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 15, 2006 9:12 PM
To: nutch-user@lucene.apache.org
Subject: Re: Boolean OR QueryFilter

  I don't think we need to modify the query filters. Look into the code of
BasicQueryFilter, I found that it takes isRequired and isProhibited flags as
arguments, so as long as we can set the flags correctly, BasicQueryFilter
will take care the rest.

  I've experimented with my approach. Let N("query") be the number of hits
returned by "query". My program yields;
   N(hello) = 77
   N(world) = 894
   N(hello world) = 926
   N(hello OR world) = 45
  So it satisfies the condition N(hello) + N(world) = N(hello world) +
N(hello OR world). Although satisfying this condition does not necessarily
mean the implementation is correct, failing to do so will definitely
indicate that the approach is wrong. To prove that my implementation is
correct, maybe I need to establish some statiscal methods.

  Regards,
   Giang


On 3/16/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> This looks like a good approach.  Note also that you will probably need
> to change BasicQueryFilter and perhaps other filters to work correctly
> with optional terms.
>
> Nguyen Ngoc Giang wrote:
> > Sorry, I'm a newbie in OS, and I'm not familiar to the way of updating
> > patches :D
> > I'll try to put my solution here first to receive comments from our
> > community. Since we must differentiate 3 possibilities: must have, may
> have
> > and must not have; we need at least 2 boolean variables in
> > org.apache.nutch.searcher.Query. In fact, these 2 boolean variables are
> > isRequired and isProhibited.
> >
> > -In the first step, I define an OR token separately in jj file. This
> will be
> > put before <WORD>. So it will look like this:
> > <OR: "OR">
> >
> > -Second, I define a new function called disjunction:
> > void disjunction() :
> > {}
> > {
> >     <OR> nonOpOrTerm()
> > }
> >
> > -Third, in the function parse(), I declare a boolean variable disj:
> > boolean disj;
> >
> > -Forth, inside parse(), once we finished looking ahead, we examine the
> > existence of OR token:
> > ( LOOKAHEAD ... )?
> > // check OR
> > (disjunction() { disj = true; })*
> >
> > -Finally, I changed the handling portion in parse():
> > if (stop
> >           && field == Clause.DEFAULT_FIELD
> >           && terms.size()==1
> >           && isStopWord(array[0])) {
> >         // ignore stop words only when single, unadorned terms in
> default
> > field
> >       } else {
> >         if (prohibited)
> >           query.addProhibitedPhrase(array, field);
> >         else if (disj)
> >           query.addOptionalPhrase(array, field);
> >         else
> >           query.addRequiredPhrase(array, field);
> >       }
> >
> >   After this point, I have finished changing the jj file. Please note
> that I
> > also have to add the method addOptionalPhrase() in
> > org.apache.nutch.searcher.Query. This method basically sets
> isRequired=false
> > and isProhibited=false. The rest has been taken care by Nutch already.
> >
> >   Regards,
> >   Giang
> >
> >
> > On 3/15/06, Laurent Michenaud <[EMAIL PROTECTED]> wrote:
> >
> >>I would like to use Boolean Query too :)
> >>
> >>-----Message d'origine-----
> >>De : Alexander Hixon [mailto:[EMAIL PROTECTED]
> >>Envoyé : mercredi 15 mars 2006 08:38
> >>À : nutch-user@lucene.apache.org
> >>Objet : RE: Boolean OR QueryFilter
> >>
> >>Maybe you could post the code on JIRA, if anyone else wishes to use
> >>Boolean operators in their search queries..? We could probably get a
> >>developer or two to put this in the 0.8 release? Since it IS open
> source.
> >>;)
> >>
> >>Just a thought,
> >>Alex
> >>
> >>-----Original Message-----
> >>From: Nguyen Ngoc Giang [mailto:[EMAIL PROTECTED]
> >>Sent: Wednesday, 15 March 2006 3:45 PM
> >>To: nutch-user@lucene.apache.org; [EMAIL PROTECTED]
> >>Subject: Re: Boolean OR QueryFilter
> >>
> >>  Hi David,
> >>
> >>  I also did a similar task. In fact, I hacked into jj code to add the
> >>definition for OR and NOT. If you need any help, don't hesitate to
> contact
> >>me :).
> >>
> >>  Regards,
> >>   Giang
> >>
> >>PS: I also believe that a hack to jj code is necessary.
> >>
> >>On 3/8/06, David Odmark <[EMAIL PROTECTED]> wrote:
> >>
> >>>Hi all,
> >>>
> >>>We're trying to implement a nutch app (version 0.8) that allows for
> >>>Boolean OR e.g. (this OR that) AND (something OR other). I've found
> >>>some relevent posts in the mailing list archive, but I think I'm
> >>>missing something. For example, here's a snippet from a post from Doug
> >>
> >>Cutting:
> >>
> >>><snip>
> >>>that said, one can implement OR as a filter (replacing or altering
> >>>BasicQueryFilter) that scans for terms whose text is "OR" in the
> >>>default field.
> >>></snip>
> >>>
> >>>The problem I'm finding is that the NutchAnalysis analyzer seems to be
> >>>swallowing all boolean terms by the time the QueryFilter is even
> >>>executed (perhaps because OR is a stop word?). To wit:
> >>>
> >>>String queryText = "this OR that";
> >>>org.apache.nutch.searcher.Query query =
> >>>org.apache.nutch.searcher.Query.parse(queryText, conf); for (int
> >>>i=0;i<query.getTerms().length;i++) {
> >>>            System.out.println("Term = " + query.getTerms()[i]); }
> >>>
> >>>This results in output that looks like this:
> >>>
> >>>Term = this
> >>>Term = that
> >>>
> >>>So am I correct in believing that in order to implement boolean OR
> >>>using Nutch search and a QueryFilter, one must also (minimally) hack
> >>>the NutchAnalysis.jj file to produce a new analyzer? Also, given that
> >>>a Nutch Query object doesn't seem to have a method to add a
> >>>non-required Term or Phrase, does that need to be modified as well?
> >>>
> >>>Sorry for the long post, and thanks in advance...
> >>>
> >>>-David Odmark
> >>>
> >>>
> >>>
> >>
> >>
> >
>

Reply via email to