[ 
https://issues.apache.org/jira/browse/NUTCH-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079558#comment-18079558
 ] 

ASF GitHub Bot commented on NUTCH-3175:
---------------------------------------

lewismc opened a new pull request, #913:
URL: https://github.com/apache/nutch/pull/913

   PR for [NUTCH-3175](https://issues.apache.org/jira/browse/NUTCH-3175).
   
   My goal was to run each protocol plugin against a real server rather than 
mocks. This PR adds a new Ant target `test-protocol-integration` wired into 
both the top-level build and GitHub Actions CI (Ubuntu only, triggered when 
protocol plugin files change). This mimics what we did previously with index 
plugins.
   
   ## Integration test framework (src/test/)
   
   * AbstractProtocolPluginIT — base class providing 
getHttpStatusCode(CrawlDatum), assertFetchSuccess(), and assertFetchNotFound() 
helpers shared across all protocol ITs.
   * ProtocolPluginIntegrationTest — JUnit 5 interface declaring the 
setUpProtocol / tearDownProtocol / getProtocol / getTestUrl contract; each 
plugin IT implements it.
   
   ## Per-plugin integration tests
   
   * protocol-ftp — FtpProtocolIT — in-process MockFtpServer 3.1.0, no Docker 
required
   * protocol-http — HttpProtocolIT — nginx:alpine via Testcontainers
   * protocol-httpclient — HttpClientProtocolIT — in-process WireMock 3.0.1, no 
Docker required
   * protocol-htmlunit — HtmlUnitProtocolIT — nginx:alpine via Testcontainers
   * protocol-okhttp — OkHttpProtocolIT — nginx:alpine via Testcontainers
   * protocol-selenium — SeleniumProtocolIT — nginx:alpine via Testcontainers
   
   Testcontainers-based tests are annotated 
`@Testcontainers(disabledWithoutDocker = true)` and skip cleanly when Docker is 
unavailable.
   
   ## Build / CI changes
   
   * `build.xml` — new top-level test-protocol-integration target delegates to 
src/plugin/build.xml.
   * `src/plugin/build.xml` — runs each protocol plugin's 
`test-protocol-integration` target sequentially to avoid container resource 
contention.
   * `src/plugin/build-plugin.xml` — adds `test-protocol-integration` target; 
adds testcontainers*.jar to the global plugin test classpath so plugins can 
compile against Testcontainers without declaring it individually.
   * `.github/workflows/master-build.yml` — adds `protocol_plugins` path filter 
and test protocol integration step, gated to `ubuntu-latest` only.
   
   ## Bug fixes in protocol-ftp (found while writing tests)
   
   This part surprised me as admittedly I hadn't ever used `protocol-ftp` 
before. These are production fixes, not test scaffolding:
   
   1. `FtpResponse`: ignored URL port — `ftp.client.connect(addr)` always 
connected to port 21, ignoring the port in the URL. Fixed to use 
`url.getPort()` with fallback to `FTP.DEFAULT_PORT`.
   2. `FtpResponse`: quoted `SYST` reply — RFC 959 allows servers to quote the 
system type (215 "UNIX"). After .substring(4) the client received "UNIX" (with 
literal quotes), causing parser initialization to fail silently with 
`ftp.parser` is `null`. Fixed with explicit quote stripping.
   3. `FtpResponse`: empty directory listing treated as exception — when a 
server returns a 150+226 response with an empty listing for a non-existent 
file, `list.get(0)` threw `IndexOutOfBoundsException`. Fixed by checking 
`list.isEmpty()` and returning 404 instead.
   4. `Ftp`: status code never set on exception — if `FtpResponse` constructor 
threw before `getProtocolOutput` reached the `datum.getMetaData().put(...)` 
call, `PROTOCOL_STATUS_CODE_KEY` was never set, causing a 
`NullPointerException` in callers. Fixed by setting code 500 in the outer catch 
block.
   5. `protocol-ftp/ivy.xml`: commons-net upgraded 1.2.2 → 3.9.0 — commons-net 
1.2.2's `UnixFTPEntryParser` depended on Apache ORO 
(org.apache.oro.text.regex), which is not on the Nutch classpath. At runtime 
this caused a `NoClassDefFoundError` that was silently swallowed by a 
finally/return block, leaving `ftp.parser = null` and every fetch returning 
HTTP 500. Upgrading to 3.9.0 eliminates the ORO dependency.
   
   ## Other fixes
   
   * `conf/log4j2.xml` — renamed internal <Property> elements from 
`hadoop.log.dir/hadoop.log.file` to `nutch.log.dir/nutch.log.file`. Hadoop's 
test harness sets system properties `hadoop.log.dir` and `hadoop.log.file` to 
self-referential values; when log4j2 resolved `${sys:hadoop.log.dir}` inside a 
property of the same name, it detected an infinite interpolation loop and 
emitted repeated `WARN` Infinite loop in property interpolation messages. 
Renaming the log4j2 properties breaks the cycle while preserving the same 
runtime behaviour.
   * `protocol-httpclient/ivy.xml` — adds WireMock 3.0.1 as a test-scoped 
dependency to support `HttpClientProtocolIT`.
   




> Implement integration testing framework for Nutch Protocol plugins using 
> Testcontainers
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-3175
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3175
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol, test
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.23
>
>
> This task is very similar to NUTCH-3154.
> The scope is to deliver a comprehensive integration testing framework for 
> Nutch Protocol plugins (implementations of Protocol interface) using 
> Testcontainers. This would enable real-world testing against actual 
> containers rather than mocking... or nothing which is what we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to