sebastian-nagel commented on PR #772: URL: https://github.com/apache/nutch/pull/772#issuecomment-1722472438
+1 A test with the [pseudo-distributed Hadoop setup](https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) was successful: - Nutch tools work properly, no issues - as expected, Hadoop puts slf4j-api-1.7.36.jar and slf4j-reload4j-1.7.36.jar in the classpath in front of the Nutch job jars - consequently, task logs are formatted using the format defined in `$HADOOP_HOMe/etc/hadoop/log4j.properties` - (the good thing) log messages from Nutch classes appear in the task logs, e.g. ``` 2023-09-17 07:29:21,726 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: FetcherThread 33 fetching https://nutch.apache.org/ (queue crawl delay=5000ms) ``` - the log format defined in `$NUTCH_HOME/conf/log4j2.xml` is only applied to the logs of the Yarn job client, e.g. ``` 2023-09-17 07:29:32,432 INFO fetcher.Fetcher: Fetcher: finished at 2023-09-17 07:29:32, elapsed: 00:00:25 ``` - in addition, I've included two PDFs, a XLSX and a ePub document, to test the Tika parser: the docs were successfully parsed using Tika 2.3.0 - if necessary I can repeat the test for NUTCH-2959 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org