It's not exactly the same way to implement it but i'm currently looking for
a way to inject at run time new urls.
my idea was to detect new interesting urls into a custom parser / html
plugin and directly inject urls into the seed list (without having to
restart nutch)



2014-05-16 9:45 GMT+02:00 Diaa Abdallah <[email protected]>:

> Hi,
> In some cases when you crawl a webpage you already know many page urls
> that have a similar structure.
>
> For example in imdb entertainment artists have the following link
> structure:
> http://www.imdb.com/name/nm1/
> http://www.imdb.com/name/nm2/
> http://www.imdb.com/name/nm6499112/
>
> How about allowing the addition  of urls based on generators?
> For example you would define in the url file:
> http://www.imdb.com/name/nm{{[1-6499112]}}
>
> where {{ <simple-regex> }} is the place to put a number/letter generator
>
> So that all these urls are injected into nutch?
>
> I could work on that if people are interested.
>
> Regards,
> Diaa
>
>


-- 
Frédéric Passaniti

Reply via email to