Re: store additional information from page at outlinks - topic specific crawl

Armin Nagel Thu, 10 May 2012 07:08:02 -0700

Hi all,

I found a solution to store metadata at outlinks.

The metadata is attached to crawldatum, so fetcher could read theinformation stored there.

Solution is, to implement a custom score filter - methoddistributeScoreToOutlinks.

In this method it is possible to do something like this, to attachmetainfo to all outlinks:

String lang = parseData.getParseMeta().get("language"); // get somemetadata from current page

for(Entry<Text, CrawlDatum> target:targets) {
MapWritable metaData = new MapWritable();
metaData.put(new Text("parentLanguage"), new Text(lang));
target.getValue().setMetaData(metaData);
}

not the best code, but the idea should be clear with this.

Ciao Armin



Am 10.05.2012 15:39, schrieb Markus Jelsma:

Hi Armin,

Please reply to the list :)

thanks

On Thursday 10 May 2012 14:28:54 you wrote:

Hey Markus,

ok, with your hint it is possible, to store metadata from a ParseFilter
in the crawlDB for the actually page. What I need is to store the
metadata at the outlinks, so if new crawlstep begins and a outlink is
read as unfetched, I got these metadata in CrawlDatum.

Is this possible to do?

I need parentpage-information at it's childs.

Ciao Armin

Am 10.05.2012 10:42, schrieb Markus Jelsma:

Hi,

You can store your a counter in a CrawlDatum's MetaData field. If added
via a ParseFilter you can map via config the value of that field to the
CrawlDatum's MetaData. In any case, you need a ParseFilter. Perhaps you
can modifiy the LangIdentifier IndexFilter to a ParseFilter. That should
work.

Cheers

On Thu, 10 May 2012 10:35:30 +0200, Armin Nagel

<[email protected]>  wrote:

Hi Markus,

thanks for your reply, but that is not what I want.
Why store data into solr that I do not need? I do not want use solr.
My goal is to crawl terra byte of data, store data in hbase or other
store and do some processing an it, so this unneeded data causes pain.
I have to filter the pages.

I'm still searching for a solution.
I figured out, that outlinks are written at the parse step, right?
To the outlink class I added a property for my counter.
The write method inside is also modified.
Just one thing is not clear, in merge step, outlinks must be inserted
into crawldb, a new crawlDatum must be created(most fields empty), and
I want to store the counter in the meta data, so I can handle it in
the fetcher.

Any suggestions to do that?

Best regards,

Armin

Am 09.05.2012 21:15, schrieb Markus Jelsma:

Hi.

On Wed, 09 May 2012 09:56:37 +0200, Armin Nagel

<[email protected]>  wrote:

Hi,

I want to perform a topic specific crawl with nutch.
My idea is, to have a classifier for page, e.g. language
identifier(for example I only want german pages).
Crawldepth is 5, should be enough.
First page to classify is a seedpage. Maybe this page is not german.
Usually the page and the outlinks should be skipped. BUT, what I want
is to go 3(for instance) steps further and try to classify the
outlinks. To do that, I have to store additional information to the
outlinks from current page, maybe a counter for crawlsteps not
matching classification. An example:

- 1 page is not german, skip page but process outlinks, mark
outlink(s) with counter +1
- outlink is not german, skip page but process outlinks, mark
outlink(s) with counter +1(still in range(crawldepth and classify
steps))
- outlink is german, process page, process out


You're processing outlinks anyway so the simplest suggestion is to crawl
everything. Enable Nutch' language-identifier which is an indexing
filter (maybe this should also be a parse filter for a topical crawler),
index all pages into Solr and facet on the lang field. You'll have your
counts then.

Hope my idea becomes a little more clear.

Best regards,

ArminHi

Re: store additional information from page at outlinks - topic specific crawl

Reply via email to