Hi,

a huge part of the sites I crawl do not have any meaningful text inside
of <title></title>. Unlike this, the <h1></h1> has often much more
detailed information about the content of the page.

How can I get the content of (only) the first element of <h1></h1> into
a seperate field, called e.g. "alttitle" in order to index it later?
I've given up to understand the HtmlParser.java.

For the indexing part, I know how to create a plugin (e.g. based on
index-more). My big problem is to write a custom parse plugin or to
modify the htmlparser. Unfortunately all previous hints in this mailing
list that I found assume me being expert in java. I think, I need a
little bit more detailled help;-)

Thanks for any help,
Felix.

Reply via email to