On Sat, Apr 6, 2013 at 9:58 AM, Parin Jogani <[email protected]> wrote:
> Hi, > Is there any way to perform a urlfilter from level 1-5 and a different one > from 5 onwards. I need to extract pdf files which will be only after a > given level (just to experiment). > You can run 2 crawls over the same crawldb using different urlfilter files. First one would be rejecting pdf files and executed till a depth just before you discover pdf files. For later crawl, modify the regex rule to accept pdf files. > After that I believe the pdf files will be stored in a compressed binary > format in the crawl\segment folder. I would like to extract these pdf files > and store all in 1 folder. (I guess since Nutch uses MapReduce by segments > the data, I will need to use the hadoop api present by default in the lib > folder. I can not find more tutorials on the same except > allenday< > http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html > > > ). > I had a peek at the link that you gave and seems like that code snippet should work. Its an old article (from 2010) so it might happen that some classes are replaced with new ones. If you face any issues, please feel free to shoot an email to us !!! > > PJ >

