Re: Indexing multiple keywords in one field?

Erik Hatcher Sun, 29 May 2005 16:39:05 -0700


On May 29, 2005, at 8:29 AM, Doug Hughes wrote:

Hi,
I'm working on a pretty typical web page search system based onlucene.Pretty much everything works great. However, I'm having oneproblem. Iwant to have a feature in this system where I can find all pageswhich linkto another page. So, for instance, I might search for all thepages linkedto http://www.foobar.com/index.html. The search term does not needto be
fuzzy in any way.  http://www.foobar.com would not match
http://www.foobar.com/. The thing is that any for any givendocument I
could have any number of associated links.
I think that each page's links could be treated as an array ofkeywords.However, I don't know the best practice for indexing this data orhow to
find matches for specific links.

One possibility is to extract the links (XPath could do this with a"//a" pattern) during a parsing phase, not during an analyzer. Builda list of links and index each one as a separate Field.Keyword()field for a single document.

I tried creating a LinebreakAnalyzer which (I think) tokenizedphrases basedon CRs and LFs. I converted the array of links to a list of linksdelimitedby LFs. When indexing I used the PerFieldAnalyzerWrapper and setthe linksfield to use the LinebreakAnalyzer. My understanding is that thelucene
index should now have each of the links indexed as separate terms or
keywords (sorry if my vocabulary is wrong!)

Links are broken per line? For general HTML parsing you certainlycannot assume that, but maybe in your documents you can? I'd besurprised at that though.

Now, all that seems to work fine. However, when I search I build Iquery
using this code:

QueryParser.parse(link, "links", new LinebreakAnalyzer())
The link is the link I'm searching for, "links" is the field I'msearching.I'm using the same analyzer I used to index the links. The problemis I
don't get any matches at all when I execute the search.
Does anyone know of any better techniques for this? Or does anyonesee
anything I'm doing wrong

The first thing to do is ensure what you think was indexed reallywas. I highly recommend you get Luke - http://www.getopt.org/luke/ -and explore the index you've built and see what terms were indexed inyour links field. Then experiment using the TermQuery API for the_exact_ terms indexed. Only then move up to QueryParser if you needthat kind of thing, using Query.toString() to dump the generatedquery instance and see what it is made of. QueryParser introduces alevel of complexity that can be confusing because there is queryexpression operators, parsing, and analysis all mixed together - andsome characters in a URL within a QueryParser expression will need toescaped to be interpreted properly (like the ":" in "http://";).


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing multiple keywords in one field?

Reply via email to