You can always create your own Analyzer which creates a TokenStream just like StandardAnalyzer, but instead of using StandardFilter, write another TokenFilter which receives the HOST token type, and breaks it further to its components (e.g., extract "en", "wikipedia" and "org"). You can also return the original HOST token and its components.
I hope this helps. Shai On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi <prashullega...@gmail.com > wrote: > Hi Phil, > > The query you gave did work. Well, that proves StandardAnalyzer has a > different way > of tokenizing URLs. > > Thanks, > Prashant. > > On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan <phil...@gmail.com> wrote: > > > Hi Prashant, > > > > I agree with Shai, that using Luke and printing out what the Document > > looks like before it goes into the index, are going to be your best > > bet for debugging this problem. > > > > The problem you're having is that StandardAnalyzer does not break-up > > the hostname into separate terms, as it has a special case for > > hostnames and acronyms. > > > > This should work... > > +title:"rahul dravid" +url:"en.wikipedia.org" > > > > Thanks, > > Phil > > > > On Sun, Aug 2, 2009 at 10:14 AM, prashant > > ullegaddi<prashullega...@gmail.com> wrote: > > > Yes, I'm sure that title:"Rahul Dravid" is extracted properly, and > there > > is > > > a document relevant to this query as well. > > > The following query and its results proves it: > > > > > > Enter query: > > > Searching for: +title:"rahul dravid" +url:wiki > > > 4 total matching documents > > > trec-id: clueweb09-enwp02-13-14368, URL: > > > http://en.wikipedia.org/wiki/Rahul_Dravid > > > trec-id: clueweb09-enwp01-83-11378, URL: > > > http://en.wikipedia.org/wiki/Rahul_S_Dravid > > > trec-id: clueweb09-en0011-08-22737, URL: > > > http://www.reference.com/browse/wiki/Rahul_Dravid > > > trec-id: clueweb09-enwp01-69-13556, URL: > > > http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid > > > Press (q)uit or enter number to jump to a page. > > > > > > But see following query: > > > > > > Enter query: > > > +title:"rahul dravid" +url:"wikipedia" > > > Searching for: +title:"rahul dravid" +url:wikipedia > > > 0 total matching documents > > > Press (q)uit or enter number to jump to a page. > > > > > > Isn't it weird? > > > > > > -- Prashant. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >