Some more context regarding limitations of the current approach. Nutch server 
is limited to working with Nutch in local mode only. Although a Web Application 
does exist [0], it was extracted from the codebase some 5 or so years ago. It 
hasn't seen any development since then. As part of this discussion I also want 
to share my opinion to retire and archive the Nutch webapp repository.
Thanks
lewismc

[0] https://github.com/apache/nutch-webapp

On 2026/02/13 01:13:19 lewis john mcgibbney wrote:
> Hi dev@,
> 
> For a while now I've been thinking that the Nutch REST API (NutchServer,
> JAX-RS/Apache CXF) [0] has become somewhat of a burden. It hasn't seen much
> activity for quite a while and the underlying dependencies are dated. I'd
> personally like to get the community's input on whether the REST API is
> still a feature we wish to continue maintaining (albeit passively). Let me
> provide context below.
> 
> Current state of the REST API
> =======================
> The REST API consists of ~35 Java source files under o.a.n.service.*,
> exposing endpoints for admin operations, job management, configuration,
> seed management, and database reading.
> From what I can tell the following issues exist:
> 
>    1. No authentication or authorization whatsoever. Every endpoint is
>    completely open, including:
>       1. GET /admin/stop -- allows unauthenticated remote server shutdown
>       2. POST /job/create -- allows unrestricted job creation
>       3. PUT /config/{configId}/{propertyId} -- allows unrestricted
>       configuration modification
>       4. No input validation (no Bean Validation annotations) no CORS policy
>    2. No health or metrics endpoints: no /health or /healthz for
>    liveness/readiness probes (relevant for Docker/K8s deployments), and no
>    /metrics endpoint for Prometheus or similar. The Docker setup
>    (docker/Dockerfile) also has no health checks defined.
>    3. Near-zero test coverage. TestNutchServer.java contains a single test
>    that attempts to start the server and hit the /admin endpoint, but both
>    assertions are commented out (//Assert.assertTrue(...)). There are no
>    endpoint-level tests for any of the other resources (job, config, seed, db,
>    reader).
>    4. Code quality issues:
>       1. Class name typo: ReaderResouce (missing 'r')
>       2. SequenceReader.java has 6 auto-generated TODO stubs (unimplemented
>       methods)
>       3. No OpenAPI/Swagger documentation (something I've wanted to do for
>       a long time)
> 
> The question for the community
> ========================
> Addressing these issues properly would be a substantial amount of work. At
> the same time, it's unclear how widely the REST API is actually being used
> (in production) by Nutch users.
> 
> Therefore I'd like to propose a few items for discussion:
> 
> (A) Invest in hardening the REST API. Add auth (at minimum API key or basic
> auth), input validation, health/metrics endpoints, OpenAPI docs, and proper
> test coverage. This is the most work but preserves the API for users who
> depend on it. QUite honestly, even if we selected this option, I would
> propose we start over with a fresh OpenAPI specification and built it out
> from there.
> (B) Deprecate the REST API. Mark it as deprecated in the current release
> with a notice that it will be removed in a future major version, giving
> users time to migrate to CLI-based workflows or their own orchestration
> layer.
> (C) Remove the REST API. Remove the o.a.n.service package entirely from the
> codebase. This eliminates the security surface and ongoing maintenance
> burden.
> 
> If anyone on the list is actively using the REST API (or knows of
> deployments that do), it would be very helpful to hear about your use case
> and whether the current API meets your needs.
> 
> Thanks,
> lewismc
> 
> P.S. My current learning is towards option B but I am really keen to read
> other opinions.
> 
> [0]
> https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/service
> -- 
> http://people.apache.org/keys/committer/lewismc
> 

Reply via email to