lewismc opened a new pull request, #913: URL: https://github.com/apache/nutch/pull/913
PR for [NUTCH-3175](https://issues.apache.org/jira/browse/NUTCH-3175). My goal was to run each protocol plugin against a real server rather than mocks. This PR adds a new Ant target `test-protocol-integration` wired into both the top-level build and GitHub Actions CI (Ubuntu only, triggered when protocol plugin files change). This mimics what we did previously with index plugins. ## Integration test framework (src/test/) * AbstractProtocolPluginIT — base class providing getHttpStatusCode(CrawlDatum), assertFetchSuccess(), and assertFetchNotFound() helpers shared across all protocol ITs. * ProtocolPluginIntegrationTest — JUnit 5 interface declaring the setUpProtocol / tearDownProtocol / getProtocol / getTestUrl contract; each plugin IT implements it. ## Per-plugin integration tests * protocol-ftp — FtpProtocolIT — in-process MockFtpServer 3.1.0, no Docker required * protocol-http — HttpProtocolIT — nginx:alpine via Testcontainers * protocol-httpclient — HttpClientProtocolIT — in-process WireMock 3.0.1, no Docker required * protocol-htmlunit — HtmlUnitProtocolIT — nginx:alpine via Testcontainers * protocol-okhttp — OkHttpProtocolIT — nginx:alpine via Testcontainers * protocol-selenium — SeleniumProtocolIT — nginx:alpine via Testcontainers Testcontainers-based tests are annotated `@Testcontainers(disabledWithoutDocker = true)` and skip cleanly when Docker is unavailable. ## Build / CI changes * `build.xml` — new top-level test-protocol-integration target delegates to src/plugin/build.xml. * `src/plugin/build.xml` — runs each protocol plugin's `test-protocol-integration` target sequentially to avoid container resource contention. * `src/plugin/build-plugin.xml` — adds `test-protocol-integration` target; adds testcontainers*.jar to the global plugin test classpath so plugins can compile against Testcontainers without declaring it individually. * `.github/workflows/master-build.yml` — adds `protocol_plugins` path filter and test protocol integration step, gated to `ubuntu-latest` only. ## Bug fixes in protocol-ftp (found while writing tests) This part surprised me as admittedly I hadn't ever used `protocol-ftp` before. These are production fixes, not test scaffolding: 1. `FtpResponse`: ignored URL port — `ftp.client.connect(addr)` always connected to port 21, ignoring the port in the URL. Fixed to use `url.getPort()` with fallback to `FTP.DEFAULT_PORT`. 2. `FtpResponse`: quoted `SYST` reply — RFC 959 allows servers to quote the system type (215 "UNIX"). After .substring(4) the client received "UNIX" (with literal quotes), causing parser initialization to fail silently with `ftp.parser` is `null`. Fixed with explicit quote stripping. 3. `FtpResponse`: empty directory listing treated as exception — when a server returns a 150+226 response with an empty listing for a non-existent file, `list.get(0)` threw `IndexOutOfBoundsException`. Fixed by checking `list.isEmpty()` and returning 404 instead. 4. `Ftp`: status code never set on exception — if `FtpResponse` constructor threw before `getProtocolOutput` reached the `datum.getMetaData().put(...)` call, `PROTOCOL_STATUS_CODE_KEY` was never set, causing a `NullPointerException` in callers. Fixed by setting code 500 in the outer catch block. 5. `protocol-ftp/ivy.xml`: commons-net upgraded 1.2.2 → 3.9.0 — commons-net 1.2.2's `UnixFTPEntryParser` depended on Apache ORO (org.apache.oro.text.regex), which is not on the Nutch classpath. At runtime this caused a `NoClassDefFoundError` that was silently swallowed by a finally/return block, leaving `ftp.parser = null` and every fetch returning HTTP 500. Upgrading to 3.9.0 eliminates the ORO dependency. ## Other fixes * `conf/log4j2.xml` — renamed internal <Property> elements from `hadoop.log.dir/hadoop.log.file` to `nutch.log.dir/nutch.log.file`. Hadoop's test harness sets system properties `hadoop.log.dir` and `hadoop.log.file` to self-referential values; when log4j2 resolved `${sys:hadoop.log.dir}` inside a property of the same name, it detected an infinite interpolation loop and emitted repeated `WARN` Infinite loop in property interpolation messages. Renaming the log4j2 properties breaks the cycle while preserving the same runtime behaviour. * `protocol-httpclient/ivy.xml` — adds WireMock 3.0.1 as a test-scoped dependency to support `HttpClientProtocolIT`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]

