I've never used it, so I'm thinking the B(ii) sounds best, unless we
hear more from current users. I'd prefer the compile-time flag option,
and unless we know at some point that no one uses it, some hand-holding
tutorial content for it.
I've worked local all along so I defer to those with experience on the
Hadoop concerns.
Thanks, stay safe, stay healthy,
Joe
On 2/13/26 21:08, BlackIce wrote:
I think that there is no real reason to keep that around, so I say
option B.
On Sat, Feb 14, 2026 at 2:53 AM Lewis John McGibbney
<[email protected]> wrote:
I didn't finish my thoughts for B(ii)
B(ii)
- provide an OpenAPI specification (see my comments at
https://github.com/apache/nutch/pull/896) and address the errors
and warnings
- immediately deprecate the existing service and note it is for
non-production use.
- a further issue concerns whether we ship the openapi.yaml (and
generated server implementation) with the Nutch codebase or
separate it out to a new repository. If we were to keep it with
the existing repository, then I would opt for some build flag
which would allow the creation of artifacts with or without Nutch
service included. A compile-time flag of sorts.
Zooming out, if the OpenAPI exists then theoretically ANYONE can
come along and generate their own server and client implementation
in any language they want. The implementation just needs to know
about NUTCH_HOME, etc. and be able to interface with Nutch classes.
One further thought I had is that the Nutch service should
theoretically be able to interface with a Hadoop cluster in order
to fetch (and cache) state rather than persist state in the
running service process. I believe this is another shortcoming of
the current implementation. Finally (for now) no authentication
layer exists with the target Hadoop cluster...
On 2026/02/14 01:29:31 Lewis John McGibbney wrote:
> Hi Isabelle,
>
> On 2026/02/13 16:53:29 Isabelle Giguere wrote:
>
> > I'm aware of one integration that uses the REST API. I'm not
working on
> > that application anymore, so I have no idea if the web crawler
remains an
> > important feature today, or if it would remain part of the
product if the
> > REST API disappears.
>
> OK
>
> > I would add one more concern to the list:
> > NutchServer method start() hard-codes the protocol "http", and
there is no
> > way to configure NutchServer to start on https, even if the
protocol was
> > not hard-coded, and the private constructor makes it
impossible to extend
> > NutchServer to fix the issue.
>
> Excellent observation. Yet one more nail in the coffin for the
current implementation as-is.
>
> > IMHO, deprecating or removing the REST API and providing only
a CLI,
> > however, could make Nutch less likely to be used in professional
> > integrations. And therefore, maybe less likely to attract
contributions.
>
> I agree to some degree. In reality the Nutch service is not
production grade right now. I haven't heard about any production
usage in a while (apart from your above reference) and I think the
current implementation maybe does more damage than good when
shipped with the current Nutch releases. It also adds (outdated)
and unnecessary dependency bloat to the Nutch .job artifacts we
submit to the Hadoop cluster.
>
> > That being said, I would not fight against option B.
>
> You got me thinking that there are maybe two sub-options for B.
>
> B(i) as per my original narrative
>
> B(ii) provide an OpenAPI specification (see my comments at
https://github.com/apache/nutch/pull/896) and deprecate the
existing service for removal in the next version of Nutch. If the
OpenAPI exists then theoretically ANYONE can come along and
generate their own server and client implementation in any
language they want. The implementation just needs to know about
NUTCH_HOME, etc. and be able to interface with Nutch classes.
>
> If we can. get enough peer review for the above PR then I would
be in favor of option B(ii).
>
> Sorry for making this more complicated.
>