I didn't finish my thoughts for B(ii) B(ii) - provide an OpenAPI specification (see my comments at https://github.com/apache/nutch/pull/896) and address the errors and warnings - immediately deprecate the existing service and note it is for non-production use. - a further issue concerns whether we ship the openapi.yaml (and generated server implementation) with the Nutch codebase or separate it out to a new repository. If we were to keep it with the existing repository, then I would opt for some build flag which would allow the creation of artifacts with or without Nutch service included. A compile-time flag of sorts.
Zooming out, if the OpenAPI exists then theoretically ANYONE can come along and generate their own server and client implementation in any language they want. The implementation just needs to know about NUTCH_HOME, etc. and be able to interface with Nutch classes. One further thought I had is that the Nutch service should theoretically be able to interface with a Hadoop cluster in order to fetch (and cache) state rather than persist state in the running service process. I believe this is another shortcoming of the current implementation. Finally (for now) no authentication layer exists with the target Hadoop cluster... On 2026/02/14 01:29:31 Lewis John McGibbney wrote: > Hi Isabelle, > > On 2026/02/13 16:53:29 Isabelle Giguere wrote: > > > I'm aware of one integration that uses the REST API. I'm not working on > > that application anymore, so I have no idea if the web crawler remains an > > important feature today, or if it would remain part of the product if the > > REST API disappears. > > OK > > > I would add one more concern to the list: > > NutchServer method start() hard-codes the protocol "http", and there is no > > way to configure NutchServer to start on https, even if the protocol was > > not hard-coded, and the private constructor makes it impossible to extend > > NutchServer to fix the issue. > > Excellent observation. Yet one more nail in the coffin for the current > implementation as-is. > > > IMHO, deprecating or removing the REST API and providing only a CLI, > > however, could make Nutch less likely to be used in professional > > integrations. And therefore, maybe less likely to attract contributions. > > I agree to some degree. In reality the Nutch service is not production grade > right now. I haven't heard about any production usage in a while (apart from > your above reference) and I think the current implementation maybe does more > damage than good when shipped with the current Nutch releases. It also adds > (outdated) and unnecessary dependency bloat to the Nutch .job artifacts we > submit to the Hadoop cluster. > > > That being said, I would not fight against option B. > > You got me thinking that there are maybe two sub-options for B. > > B(i) as per my original narrative > > B(ii) provide an OpenAPI specification (see my comments at > https://github.com/apache/nutch/pull/896) and deprecate the existing service > for removal in the next version of Nutch. If the OpenAPI exists then > theoretically ANYONE can come along and generate their own server and client > implementation in any language they want. The implementation just needs to > know about NUTCH_HOME, etc. and be able to interface with Nutch classes. > > If we can. get enough peer review for the above PR then I would be in favor > of option B(ii). > > Sorry for making this more complicated. >

