Re: Integrating Nutch into the Bigtop Ecosystem

lewis john mcgibbney Wed, 04 Mar 2026 14:06:23 -0800

Hi Seb,

On Wed, Mar 4, 2026, 11:08 AM <[email protected]> wrote:


>
> thanks! Since Common Crawl runs Nutch on a Bigtop cluster
> any deeper integration into the Bigtop ecosystem is very welcome.
> The smoke tests are maybe the most useful part. Every time
> Bigtop is updated, it takes a while to verify that Nutch and all
> it's plugins are running smoothly.
>

Yes smoke testing was honesty my main goal as I decided to explore, learn
and work on this initiative.


> But I have no good idea about packaging. All the Bigtop packages
> are infrastructure providing core components or services. The
> way how Nutch is used and deployed on a Hadoop cluster is all
> in the "user space": jar and configuration files are specific
> for this particular Nutch job setup and their classpath does not
> interfere with that of other jobs.
>

Agreed. The issue (as you mention below) is that Nutch doesn't run as a
persistent service such as Solr, Ranger, etc. Various packages (Debian,
Ubuntu etc.) can be generated but some further thought needs to be put into
package characteristics.


> "Nutch server" would go easier as a Bigtop package. But Nutch
> server was never designed to run on a cluster, just in local mode.
>

Correct. The Nutch OpenAPI specification at
https://github.com/apache/nutch/pull/896 could be extended or remained to
offer a Hadoop cluster interface. Packaging would then align with all
existing Bigtop services.

Thanks for chiming in.
lewismc

>

Re: Integrating Nutch into the Bigtop Ecosystem

Reply via email to