I'm planning on indexing XML/HTML files. I only want to index the text
contained in the files and not any of the elements or tags. I just finished
reading Chapter 6 of "Ferret" (Balmain/O'Reilley) that presented a solution
for this issue. The essence of the solution was to parse the XML/HTML and
extract the text content using a parser such as Hpricot. My concern is that
this approach will not support highlighting of the results [correct me if
I'm wrong here] since the corresponding indexed field will only contain text
without the elements and tags that are necessary to indicate the position of
the text. Question: wouldn't a better approach be to implement a tokenizer
that ignores XML/HTML tags and preserves the positions of the appropriately
indexed items? If this is indeed an ideal approach does such a solution
exist or, alternatively, how can I contribute when I implement it?

Regards,
John
aka sd.codewarrior
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to