Re: Nutch 1.7 HTMLParseFilter plugin dev

Sebastian Nagel Tue, 17 Sep 2013 05:47:06 -0700

Hello Ivan,

> Where are the logs? I suppose to see them on the console output thile
> running the "hadoop jar nutch.job". Maybe that code is executing on the
> DataNode??


Yes, these logs should be on the nodes where the tasks have been run.
Search for "hadoop log location", the answer may depend on the Hadoop
distribution used.

> Please help me with understanding of the logging while running the Nutch as
> a hadoop jar and debugging it.

Again try a search, eg. "debugging hadoop". Tom White's book
"Hadoop: The Definitive Guide" also contains a section "Debugging Jobs".

In general, it's much simpler to do debugging without cluster, and
even without Hadoop. To debug a parse filter you should start with
a single document parsed by
  % bin/nutch parsechecker <url>
If your plugin works, continue testing with more documents/urls in local mode.

Sebastian

On 09/17/2013 08:36 AM, Ivan Kozlov wrote:
> Hello all,
> 
> I want to write my own Nutch plugin which will extends the HTMLParseFilter.
> But I faced some issues.
> 
> Prereqs:
> 
>    - I have a Hadoop cluster with 5 nodes. Node #1 is the Namenode, nodes
>    #3-#5 are the Datanodes.
>    - I compile nutch via ant to get the nutchXXX.job (my plugin compiles
>    ok, all changes in the nutch-site and plugins.xml are made)
>    - I run the nutch.job on the Namenode(#1): hadoop jar nutch.job -params.
> 
> *First issue:*
> 
> I cannot see the logs. My plugin just log the args:
> 
> public ParseResult filter(Content content, ParseResult parseResult,
> HTMLMetaTags metaTags, DocumentFragment doc) {
>     LOG.info("CleanParseFilterImpl: ");
>     LOG.info("content : " + content);
>     LOG.info("parseResult : " + parseResult);
>     LOG.info("metaTags : " + metaTags);
>     LOG.info("doc : " + doc);
>     return parseResult;}
> 
> I've changed the hadoop executible to disable the root logger:
> 
> #HADOOP_OPTS="$HADOOP_OPTS
> -Dhadoop.root.logger=${HADOOP_ROOT_LOGGER:-INFO,console}"
> 
> and added the logs into the hadoop's log4j.properties:
> These questions concern Hadoop. You can search for "hadoop log location"
and
> #special logging requirements for some commandline tools
> log4j.logger.org.apache.nutch.crawl.Crawl=ALL,console
> log4j.logger.org.apache.nutch.crawl.Injector=ALL,console
> log4j.logger.org.apache.nutch.crawl.Generator=ALL,console
> log4j.logger.org.apache.nutch.fetcher.Fetcher=ALL,console
> log4j.logger.org.apache.nutch.parse.ParseSegment=ALL,console
> log4j.logger.org.apache.nutch.crawl.CrawlDbReader=ALL,console
> log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=ALL,console
> log4j.logger.org.apache.nutch.crawl.LinkDbReader=ALL,console
> log4j.logger.org.apache.nutch.segment.SegmentReader=ALL,console
> log4j.logger.org.apache.nutch.segment.SegmentMerger=ALL,console
> log4j.logger.org.apache.nutch.crawl.CrawlDb=ALL,console
> log4j.logger.org.apache.nutch.crawl.LinkDb=ALL,console
> log4j.logger.org.apache.nutch.crawl.LinkDbMerger=ALL,console
> log4j.logger.org.apache.nutch.indexer.IndexingJob=ALL,console
> log4j.logger.org.apache.nutch.indexer.solr.SolrIndexer=ALL,console
> log4j.logger.org.apache.nutch.indexer.solr.SolrWriter=ALL,console
> log4j.logger.org.apache.nutch.indexer.solr.SolrDeleteDuplicates=ALL,console
> log4j.logger.org.apache.nutch.indexer.solr.SolrClean=ALL,console
> log4j.logger.org.apache.nutch.scoring.webgraph.WebGraph=ALL,console
> log4j.logger.org.apache.nutch.scoring.webgraph.LinkRank=ALL,console
> log4j.logger.org.apache.nutch.scoring.webgraph.Loops=ALL,console
> log4j.logger.org.apache.nutch.scoring.webgraph.ScoreUpdater=ALL,console
> log4j.logger.org.apache.nutch.parse.ParserChecker=ALL,console
> log4j.logger.org.apache.nutch.indexer.IndexingFiltersChecker=ALL,console
> log4j.logger.org.apache.nutch.tools.FreeGenerator=ALL,console
> log4j.logger.org.apache.nutch.util.domain.DomainStatistics=ALL,console
> log4j.logger.org.apache.nutch.tools.CrawlDBScanner=ALL,console
> log4j.logger.org.apache.nutch.parse.clean.CleanParseFilterImpl=ALL,console
> log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
> log4j.logger.org.apache.nutch.parse.ParseUtil=ALL,console
> log4j.logger.org.apache.nutch=ALL,console
> 
> From debug I can see that
> 
> LoggerFactory.getLogger("org.apache.nutch.parse.ParseUtil").isTraceEnabled()
> andLoggerFactory.getLogger("org.apache.nutch.parse.clean.CleanParseFilterImpl").isTraceEnabled()
> 
> are "true", but I still dont' see any logs from my plugin or from
> ParseUtil...
> 
> Where are the logs? I suppose to see them on the console output thile
> running the "hadoop jar nutch.job". Maybe that code is executing on the
> DataNode??
> 
> *Second issue:*
> 
> I cannot debug the code in places where i want. E.g. I can debug Crawl.java
> in the
> 
> fetcher.fetch(segs[0], threads);  // fetch it
>   if (!Fetcher.isParsing(job)) {
>     parseSegment.parse(segs[0]);    // parse it, if needed
>   }
> 
> And see that "!Fetcher.isParsing(job)" is "true" and I go into the
> parseSegment.parse().
> 
> But I cannot debug the map() method on the ParseSegment (where the
> ParseUtil.parse() logic executes). Why I can't debug that? Maybe that code
> is executing on the DataNode??
> 
> Please help me with understanding of the logging while running the Nutch as
> a hadoop jar and debugging it.
>

Re: Nutch 1.7 HTMLParseFilter plugin dev

Reply via email to