Fuad Efendi wrote:
Andrzej,
I am trying to restore human-oriented web-site tree using anchor text! As a
samle, page with anchor text "Motherboards" has many linked pages with
concrete motherboards, etc; we can group information in many cases.
Anchor text is the true subject of the page, but within same domain. BTW,
Well, as your original observation points out this is not always the
case - but this is more a topic for a philosophical debate about what is
the truth...
some pages have <META name="keywords" content="...">, and Nutch doesn't
handle it.
Nutch does handle META tags up to a point, i.e. they are correctly
processed in parse-html, and then passed to all HtmlParseFilters - and
it's up to you what you want to do with them; you can put them into
parseData.metadata, you can later on index them, etc... but by default
Nutch doesn't process them any further.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com