I've never used it, so I'm thinking the B(ii) sounds best, unless we hear more from current users. I'd prefer the compile-time flag option, and unless we know at some point that no one uses it, some hand-holding tutorial content for it.

I've worked local all along so I defer to those with experience on the Hadoop concerns.

 Thanks, stay safe, stay healthy,

 Joe

On 2/13/26 21:08, BlackIce wrote:
I think that there is no real reason to keep that around, so I say option B.

On Sat, Feb 14, 2026 at 2:53 AM Lewis John McGibbney <[email protected]> wrote:

    I didn't finish my thoughts for B(ii)

    B(ii)
    - provide an OpenAPI specification (see my comments at
    https://github.com/apache/nutch/pull/896) and address the errors
    and warnings
    - immediately deprecate the existing service and note it is for
    non-production use.
    - a further issue concerns whether we ship the openapi.yaml (and
    generated server implementation) with the Nutch codebase or
    separate it out to a new repository. If we were to keep it with
    the existing repository, then I would opt for some build flag
    which would allow the creation of artifacts with or without Nutch
    service included. A compile-time flag of sorts.

    Zooming out, if the OpenAPI exists then theoretically ANYONE can
    come along and generate their own server and client implementation
    in any language they want. The implementation just needs to know
    about NUTCH_HOME, etc. and be able to interface with Nutch classes.

    One further thought I had is that the Nutch service should
    theoretically be able to interface with a Hadoop cluster in order
    to fetch (and cache) state rather than persist state in the
    running service process. I believe this is another shortcoming of
    the current implementation. Finally (for now) no authentication
    layer exists with the target Hadoop cluster...

    On 2026/02/14 01:29:31 Lewis John McGibbney wrote:
    > Hi Isabelle,
    >
    > On 2026/02/13 16:53:29 Isabelle Giguere wrote:
    >
    > > I'm aware of one integration that uses the REST API.  I'm not
    working on
    > > that application anymore, so I have no idea if the web crawler
    remains an
    > > important feature today, or if it would remain part of the
    product if the
    > > REST API disappears.
    >
    > OK
    >
    > > I would add one more concern to the list:
    > > NutchServer method start() hard-codes the protocol "http", and
    there is no
    > > way to configure NutchServer to start on https, even if the
    protocol was
    > > not hard-coded, and the private constructor makes it
    impossible to extend
    > > NutchServer to fix the issue.
    >
    > Excellent observation. Yet one more nail in the coffin for the
    current implementation as-is.
    >
    > > IMHO, deprecating or removing the REST API and providing only
    a CLI,
    > > however, could make Nutch less likely to be used in professional
    > > integrations.  And therefore, maybe less likely to attract
    contributions.
    >
    > I agree to some degree. In reality the Nutch service is not
    production grade right now. I haven't heard about any production
    usage in a while (apart from your above reference) and I think the
    current implementation maybe does more damage than good when
    shipped with the current Nutch releases. It also adds (outdated)
    and unnecessary dependency bloat to the Nutch .job artifacts we
    submit to the Hadoop cluster.
    >
    > > That being said, I would not fight against option B.
    >
    > You got me thinking that there are maybe two sub-options for B.
    >
    > B(i) as per my original narrative
    >
    > B(ii) provide an OpenAPI specification (see my comments at
    https://github.com/apache/nutch/pull/896) and deprecate the
    existing service for removal in the next version of Nutch. If the
    OpenAPI exists then theoretically ANYONE can come along and
    generate their own server and client implementation in any
    language they want. The implementation just needs to know about
    NUTCH_HOME, etc. and be able to interface with Nutch classes.
    >
    > If we can. get enough peer review for the above PR then I would
    be in favor of option B(ii).
    >
    > Sorry for making this more complicated.
    >

Reply via email to