It's not exactly the same way to implement it but i'm currently looking for a way to inject at run time new urls. my idea was to detect new interesting urls into a custom parser / html plugin and directly inject urls into the seed list (without having to restart nutch)
2014-05-16 9:45 GMT+02:00 Diaa Abdallah <[email protected]>: > Hi, > In some cases when you crawl a webpage you already know many page urls > that have a similar structure. > > For example in imdb entertainment artists have the following link > structure: > http://www.imdb.com/name/nm1/ > http://www.imdb.com/name/nm2/ > http://www.imdb.com/name/nm6499112/ > > How about allowing the addition of urls based on generators? > For example you would define in the url file: > http://www.imdb.com/name/nm{{[1-6499112]}} > > where {{ <simple-regex> }} is the place to put a number/letter generator > > So that all these urls are injected into nutch? > > I could work on that if people are interested. > > Regards, > Diaa > > -- Frédéric Passaniti

