Ricardo J. Méndez wrote: > Hi, > > I posted this to nutch-agent as well, but nutch-user seems to be more > active. > > I've got a few questions about customizing the crawling process. I > tried checking out the Wiki, but many of the pages linked from > "Becoming a Nutch Developer" are still unwritten, so any pointers you > can provide would be very welcome.
Which pages are still unwritten? > > While some of the issues were covered on the recent "focused crawls" > thread, I still have a few questions. > > 1) Which types of links does Nutch follow? Only HREFs? If so, I'd like > it to follow some <link /> references from the page's Header. I know > that I can obtain the link reference with a Parse plugin, but how should > I add the reference to the list of items to be crawled? Nutch gets outlinks from the pages it parses. This is either during the fetch process with parsing enabled or during just a parse process (see org.apache.nutch.parse.ParseSegment). The content is parsed via plugins configured in parse-plugins.xml in the conf directory. During the parse links are created as Outlink objects that are added to a ParseData object that is itself added to a Parse object. During the writing out of the parse object (ParseOutputFormat) the outlinks are saved as CrawlDatums in the crawl_parse directory under the segment. Then during the UpdateDb job (see CrawlDb) this crawl_parse is merged into the master Crawl Database. That is the long answer. Short answer is when you parse get Outlinks and add them to the ParseData -> Parse object and then will be updated automatically to he CrawlDb when the UpdateDb job is run and it will be fetched when the next Fetch job is run. > > 2) Which type of plugin or response from one - if any - determines what > items go into the database? For instance, can I write a plugin that > returns "false" if I don't want the database to store a PDF, or a Word > document? Or maybe a specific page, based on something found by a Parse > plugin? You can write url filters and url normalizers (scope outlink) that will prevent items from going into the CrawlDb. Or if you are writing your own parse plugin, simply don't add the link to the Outlinks. Dennis Kubes > > Thanks in advance, > > > > Ricardo J. Méndez > http://ricardo.strangevistas.net/ ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
