indexing anchor text

Tim Sturge Wed, 27 Jun 2007 14:34:59 -0700

Hi,

I'm trying to index some fairly standard html documents. For each of thedocuments, there is a unique <title> (which I believe is generally ofhigh quality), some <body> content, and some anchor text from thelinking documents (which is of good but more variable quality).


I'm indexing them in "title" "anchor" and "body"

"title" and "body" are obvious (you just give the text to theStandardAnalyzer) but I don't really know how to handle the anchor text.Suppose the page with the title "United States" I know has the anchortext "USA" 500 times, "United States" 200 times, "United States ofAmerica" 100 times and "Unite Stats" once.


How do I index this?

1) index a single "anchor" field containing "USA United States UnitedStates of America Unite Stats",2) create the field "USA USA ...500x... USA United States ...200x...United States ... " and index that as "anchor"

3) create 801 "anchor" fields (500 containg USA etc)

4) create 4 "anchor" fields and call setBoost() on each with someconstants. (how do I calculate them?)

I suspect these give me different results in some way, but I'm havingtrouble understanding what the difference between 2) and 3) is and howto make 4) work like 3). I also worry that 2) and 3) are much slowerthan they need to be.


Any help is appreciated,

Tim




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

indexing anchor text

Reply via email to