Re: [Nutch-general] Customize the crawl process

Dennis Kubes Fri, 08 Sep 2006 09:33:01 -0700

You would need to modify Fetcher line 433 to use a a text output format 
like this:


job.setOutputFormat(TextOutputFormat.class);

and you would need to modify Fetcher line 307 only collect the 
information you are looking for, maybe something link this:

        Outlink[] links = parse.getData().getOutlinks();
        for (int i = 0; i < links.length; i++) {
             output.collect(key, links[i]);
        }

Dennis


NamNH wrote:
> I want to customize the crawling process by modifying the way pages are
> stored. As far as I know, Nutch will stored web pages in binary file, 
> page
> by page. After a link analysis step, Nutch will crawl to the destination
> page and download it. When pages are stored, I want to write only link 
> to a
> different text/binary file with the structure in the example below
> E.g. Assuming that page A has link to page B, C and we number them 1, 
> 2 and
> 3. I want to write my file as
> 1 2 (Enter for a new line)
> 1 3
> and etc.
> How can I do this with Nutch? Please provide me  some hints. Thank you 
> very
> much.
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Customize the crawl process

Reply via email to