lewismc opened a new pull request, #913:
URL: https://github.com/apache/nutch/pull/913

   PR for [NUTCH-3175](https://issues.apache.org/jira/browse/NUTCH-3175).
   
   My goal was to run each protocol plugin against a real server rather than 
mocks. This PR adds a new Ant target `test-protocol-integration` wired into 
both the top-level build and GitHub Actions CI (Ubuntu only, triggered when 
protocol plugin files change). This mimics what we did previously with index 
plugins.
   
   ## Integration test framework (src/test/)
   
   * AbstractProtocolPluginIT — base class providing 
getHttpStatusCode(CrawlDatum), assertFetchSuccess(), and assertFetchNotFound() 
helpers shared across all protocol ITs.
   * ProtocolPluginIntegrationTest — JUnit 5 interface declaring the 
setUpProtocol / tearDownProtocol / getProtocol / getTestUrl contract; each 
plugin IT implements it.
   
   ## Per-plugin integration tests
   
   * protocol-ftp — FtpProtocolIT — in-process MockFtpServer 3.1.0, no Docker 
required
   * protocol-http — HttpProtocolIT — nginx:alpine via Testcontainers
   * protocol-httpclient — HttpClientProtocolIT — in-process WireMock 3.0.1, no 
Docker required
   * protocol-htmlunit — HtmlUnitProtocolIT — nginx:alpine via Testcontainers
   * protocol-okhttp — OkHttpProtocolIT — nginx:alpine via Testcontainers
   * protocol-selenium — SeleniumProtocolIT — nginx:alpine via Testcontainers
   
   Testcontainers-based tests are annotated 
`@Testcontainers(disabledWithoutDocker = true)` and skip cleanly when Docker is 
unavailable.
   
   ## Build / CI changes
   
   * `build.xml` — new top-level test-protocol-integration target delegates to 
src/plugin/build.xml.
   * `src/plugin/build.xml` — runs each protocol plugin's 
`test-protocol-integration` target sequentially to avoid container resource 
contention.
   * `src/plugin/build-plugin.xml` — adds `test-protocol-integration` target; 
adds testcontainers*.jar to the global plugin test classpath so plugins can 
compile against Testcontainers without declaring it individually.
   * `.github/workflows/master-build.yml` — adds `protocol_plugins` path filter 
and test protocol integration step, gated to `ubuntu-latest` only.
   
   ## Bug fixes in protocol-ftp (found while writing tests)
   
   This part surprised me as admittedly I hadn't ever used `protocol-ftp` 
before. These are production fixes, not test scaffolding:
   
   1. `FtpResponse`: ignored URL port — `ftp.client.connect(addr)` always 
connected to port 21, ignoring the port in the URL. Fixed to use 
`url.getPort()` with fallback to `FTP.DEFAULT_PORT`.
   2. `FtpResponse`: quoted `SYST` reply — RFC 959 allows servers to quote the 
system type (215 "UNIX"). After .substring(4) the client received "UNIX" (with 
literal quotes), causing parser initialization to fail silently with 
`ftp.parser` is `null`. Fixed with explicit quote stripping.
   3. `FtpResponse`: empty directory listing treated as exception — when a 
server returns a 150+226 response with an empty listing for a non-existent 
file, `list.get(0)` threw `IndexOutOfBoundsException`. Fixed by checking 
`list.isEmpty()` and returning 404 instead.
   4. `Ftp`: status code never set on exception — if `FtpResponse` constructor 
threw before `getProtocolOutput` reached the `datum.getMetaData().put(...)` 
call, `PROTOCOL_STATUS_CODE_KEY` was never set, causing a 
`NullPointerException` in callers. Fixed by setting code 500 in the outer catch 
block.
   5. `protocol-ftp/ivy.xml`: commons-net upgraded 1.2.2 → 3.9.0 — commons-net 
1.2.2's `UnixFTPEntryParser` depended on Apache ORO 
(org.apache.oro.text.regex), which is not on the Nutch classpath. At runtime 
this caused a `NoClassDefFoundError` that was silently swallowed by a 
finally/return block, leaving `ftp.parser = null` and every fetch returning 
HTTP 500. Upgrading to 3.9.0 eliminates the ORO dependency.
   
   ## Other fixes
   
   * `conf/log4j2.xml` — renamed internal <Property> elements from 
`hadoop.log.dir/hadoop.log.file` to `nutch.log.dir/nutch.log.file`. Hadoop's 
test harness sets system properties `hadoop.log.dir` and `hadoop.log.file` to 
self-referential values; when log4j2 resolved `${sys:hadoop.log.dir}` inside a 
property of the same name, it detected an infinite interpolation loop and 
emitted repeated `WARN` Infinite loop in property interpolation messages. 
Renaming the log4j2 properties breaks the cycle while preserving the same 
runtime behaviour.
   * `protocol-httpclient/ivy.xml` — adds WireMock 3.0.1 as a test-scoped 
dependency to support `HttpClientProtocolIT`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to