sebastian-nagel commented on PR #772:
URL: https://github.com/apache/nutch/pull/772#issuecomment-1722472438

   +1
   
   A test with the [pseudo-distributed Hadoop 
setup](https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) was 
successful:
   - Nutch tools work properly, no issues
   - as expected, Hadoop puts slf4j-api-1.7.36.jar and 
slf4j-reload4j-1.7.36.jar in the classpath in front of the Nutch job jars
   - consequently, task logs are formatted using the format defined in 
`$HADOOP_HOMe/etc/hadoop/log4j.properties`
   - (the good thing) log messages from Nutch classes appear in the task logs, 
e.g.
     ```
      2023-09-17 07:29:21,726 INFO [FetcherThread] 
org.apache.nutch.fetcher.FetcherThread: FetcherThread 33 fetching 
https://nutch.apache.org/ (queue crawl delay=5000ms)
     ```
   - the log format defined in `$NUTCH_HOME/conf/log4j2.xml` is only applied to 
the logs of the Yarn job client, e.g.
     ```
     2023-09-17 07:29:32,432 INFO fetcher.Fetcher: Fetcher: finished at 
2023-09-17 07:29:32, elapsed: 00:00:25
     ```
   - in addition, I've included two PDFs, a XLSX and a ePub document, to test 
the Tika parser: the docs were successfully parsed using Tika 2.3.0 - if 
necessary I can repeat the test for NUTCH-2959
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to