Hi Markus,
thanks for your reply, but that is not what I want.
Why store data into solr that I do not need? I do not want use solr. My
goal is to crawl terra byte of data, store data in hbase or other store
and do some processing an it, so this unneeded data causes pain. I have
to filter the pages.
I'm still searching for a solution.
I figured out, that outlinks are written at the parse step, right?
To the outlink class I added a property for my counter.
The write method inside is also modified.
Just one thing is not clear, in merge step, outlinks must be inserted
into crawldb, a new crawlDatum must be created(most fields empty), and I
want to store the counter in the meta data, so I can handle it in the
fetcher.
Any suggestions to do that?
Best regards,
Armin
Am 09.05.2012 21:15, schrieb Markus Jelsma:
Hi.
On Wed, 09 May 2012 09:56:37 +0200, Armin Nagel
<[email protected]> wrote:
Hi,
I want to perform a topic specific crawl with nutch.
My idea is, to have a classifier for page, e.g. language
identifier(for example I only want german pages).
Crawldepth is 5, should be enough.
First page to classify is a seedpage. Maybe this page is not german.
Usually the page and the outlinks should be skipped. BUT, what I want
is to go 3(for instance) steps further and try to classify the
outlinks. To do that, I have to store additional information to the
outlinks from current page, maybe a counter for crawlsteps not
matching classification. An example:
- 1 page is not german, skip page but process outlinks, mark
outlink(s) with counter +1
- outlink is not german, skip page but process outlinks, mark
outlink(s) with counter +1(still in range(crawldepth and classify
steps))
- outlink is german, process page, process out
You're processing outlinks anyway so the simplest suggestion is to crawl
everything. Enable Nutch' language-identifier which is an indexing
filter (maybe this should also be a parse filter for a topical crawler),
index all pages into Solr and facet on the lang field. You'll have your
counts then.
Hope my idea becomes a little more clear.
Best regards,
Armin