Hi dev@,

For a while now I've been thinking that the Nutch REST API (NutchServer,
JAX-RS/Apache CXF) [0] has become somewhat of a burden. It hasn't seen much
activity for quite a while and the underlying dependencies are dated. I'd
personally like to get the community's input on whether the REST API is
still a feature we wish to continue maintaining (albeit passively). Let me
provide context below.

Current state of the REST API
=======================
The REST API consists of ~35 Java source files under o.a.n.service.*,
exposing endpoints for admin operations, job management, configuration,
seed management, and database reading.
>From what I can tell the following issues exist:

   1. No authentication or authorization whatsoever. Every endpoint is
   completely open, including:
      1. GET /admin/stop -- allows unauthenticated remote server shutdown
      2. POST /job/create -- allows unrestricted job creation
      3. PUT /config/{configId}/{propertyId} -- allows unrestricted
      configuration modification
      4. No input validation (no Bean Validation annotations) no CORS policy
   2. No health or metrics endpoints: no /health or /healthz for
   liveness/readiness probes (relevant for Docker/K8s deployments), and no
   /metrics endpoint for Prometheus or similar. The Docker setup
   (docker/Dockerfile) also has no health checks defined.
   3. Near-zero test coverage. TestNutchServer.java contains a single test
   that attempts to start the server and hit the /admin endpoint, but both
   assertions are commented out (//Assert.assertTrue(...)). There are no
   endpoint-level tests for any of the other resources (job, config, seed, db,
   reader).
   4. Code quality issues:
      1. Class name typo: ReaderResouce (missing 'r')
      2. SequenceReader.java has 6 auto-generated TODO stubs (unimplemented
      methods)
      3. No OpenAPI/Swagger documentation (something I've wanted to do for
      a long time)

The question for the community
========================
Addressing these issues properly would be a substantial amount of work. At
the same time, it's unclear how widely the REST API is actually being used
(in production) by Nutch users.

Therefore I'd like to propose a few items for discussion:

(A) Invest in hardening the REST API. Add auth (at minimum API key or basic
auth), input validation, health/metrics endpoints, OpenAPI docs, and proper
test coverage. This is the most work but preserves the API for users who
depend on it. QUite honestly, even if we selected this option, I would
propose we start over with a fresh OpenAPI specification and built it out
from there.
(B) Deprecate the REST API. Mark it as deprecated in the current release
with a notice that it will be removed in a future major version, giving
users time to migrate to CLI-based workflows or their own orchestration
layer.
(C) Remove the REST API. Remove the o.a.n.service package entirely from the
codebase. This eliminates the security surface and ongoing maintenance
burden.

If anyone on the list is actively using the REST API (or knows of
deployments that do), it would be very helpful to hear about your use case
and whether the current API meets your needs.

Thanks,
lewismc

P.S. My current learning is towards option B but I am really keen to read
other opinions.

[0]
https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/service
-- 
http://people.apache.org/keys/committer/lewismc

Reply via email to