Hi,
I want to perform a topic specific crawl with nutch.
My idea is, to have a classifier for page, e.g. language identifier(for
example I only want german pages).
Crawldepth is 5, should be enough.
First page to classify is a seedpage. Maybe this page is not german.
Usually the page and the outlinks should be skipped. BUT, what I want is
to go 3(for instance) steps further and try to classify the outlinks. To
do that, I have to store additional information to the outlinks from
current page, maybe a counter for crawlsteps not matching
classification. An example:
- 1 page is not german, skip page but process outlinks, mark outlink(s)
with counter +1
- outlink is not german, skip page but process outlinks, mark outlink(s)
with counter +1(still in range(crawldepth and classify steps))
- outlink is german, process page, process outlinks
Hope my idea becomes a little more clear.
Best regards,
Armin