Re: Nutch

Tejas Patil Sat, 06 Apr 2013 14:58:51 -0700

On Sat, Apr 6, 2013 at 9:58 AM, Parin Jogani <[email protected]> wrote:


> Hi,
> Is there any way to perform a urlfilter from level 1-5 and a different one
> from 5 onwards. I need to extract pdf files which will be only after a
> given level (just to experiment).
>
You can run 2 crawls over the same crawldb using different urlfilter files.
First one would be rejecting pdf files and executed till a depth just
before you discover pdf files. For later crawl, modify the regex rule to
accept pdf files.


> After that I believe the pdf files will be stored in a compressed binary
> format in the crawl\segment folder. I would like to extract these pdf files
> and store all in 1 folder. (I guess since Nutch uses MapReduce by segments
> the data, I will need to use the hadoop api present by default in the lib
> folder. I can not find more tutorials on the same except
> allenday<
> http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html
> >
> ).
>
I had a peek at the link that you gave and seems like that code snippet
should work. Its an old article (from 2010) so it might happen that some
classes are replaced with new ones. If you face any issues, please feel
free to shoot an email to us !!!

>
> PJ
>

Re: Nutch

Reply via email to