Re: store additional information from page at outlinks - topic specific crawl

Armin Nagel Thu, 10 May 2012 01:36:02 -0700

Hi Markus,

thanks for your reply, but that is not what I want.

Why store data into solr that I do not need? I do not want use solr. Mygoal is to crawl terra byte of data, store data in hbase or other storeand do some processing an it, so this unneeded data causes pain. I haveto filter the pages.


I'm still searching for a solution.
I figured out, that outlinks are written at the parse step, right?
To the outlink class I added a property for my counter.
The write method inside is also modified.

Just one thing is not clear, in merge step, outlinks must be insertedinto crawldb, a new crawlDatum must be created(most fields empty), and Iwant to store the counter in the meta data, so I can handle it in thefetcher.


Any suggestions to do that?

Best regards,

Armin

Am 09.05.2012 21:15, schrieb Markus Jelsma:

Hi.

On Wed, 09 May 2012 09:56:37 +0200, Armin Nagel
<[email protected]> wrote:

Hi,

I want to perform a topic specific crawl with nutch.
My idea is, to have a classifier for page, e.g. language
identifier(for example I only want german pages).
Crawldepth is 5, should be enough.
First page to classify is a seedpage. Maybe this page is not german.
Usually the page and the outlinks should be skipped. BUT, what I want
is to go 3(for instance) steps further and try to classify the
outlinks. To do that, I have to store additional information to the
outlinks from current page, maybe a counter for crawlsteps not
matching classification. An example:

- 1 page is not german, skip page but process outlinks, mark
outlink(s) with counter +1
- outlink is not german, skip page but process outlinks, mark
outlink(s) with counter +1(still in range(crawldepth and classify
steps))
- outlink is german, process page, process out


You're processing outlinks anyway so the simplest suggestion is to crawl
everything. Enable Nutch' language-identifier which is an indexing
filter (maybe this should also be a parse filter for a topical crawler),
index all pages into Solr and facet on the lang field. You'll have your
counts then.


Hope my idea becomes a little more clear.

Best regards,

Armin

Re: store additional information from page at outlinks - topic specific crawl

Reply via email to