Hi,
In some cases when you crawl a webpage you already know many page urls that
have a similar structure.

For example in imdb entertainment artists have the following link structure:
http://www.imdb.com/name/nm1/
http://www.imdb.com/name/nm2/
http://www.imdb.com/name/nm6499112/

How about allowing the addition  of urls based on generators?
For example you would define in the url file:
http://www.imdb.com/name/nm{{[1-6499112]}}

where {{ <simple-regex> }} is the place to put a number/letter generator

So that all these urls are injected into nutch?

I could work on that if people are interested.

Regards,
Diaa

Reply via email to