Vinsil wrote:
Thanks for this sweet workaround and all my apologies for the delay.

You can use a workaround: prepare necessary metadata in a HtmlParseFilter,
which has access to the full DOM tree, and put it into ParseData.metadata.
Using the surrounding page's metadata to pass data about the outlinks sounds
like a *very* nice workaround to me.
Would adding one metadata per Outlink make sense?

I don't know your requirements - it's up to you to decide what you want to achieve.

These metadata could be removed in ScoringFilter.passScoreAfterParsing (I
guess...). Their number should also be limited using
db.max.outlinks.per.page.
. 
Their number should be limited using Wouldn't there be ugly consequences of adding "so many" metadata even
temporarily?

Well, if you add kilobytes of metadata per CrawlDatum, then yes, it will considerably slow down the processing, because of the increased amount of data to transfer and process. Other than that - no.

... HtmlParseFilter which has access to the full DOM tree
Is it through the DocumentFragment object that is passed to
HtmlParseFilter.parse?

Yes.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to