Hi Seb, On Wed, Mar 4, 2026, 11:08 AM <[email protected]> wrote:
> > thanks! Since Common Crawl runs Nutch on a Bigtop cluster > any deeper integration into the Bigtop ecosystem is very welcome. > The smoke tests are maybe the most useful part. Every time > Bigtop is updated, it takes a while to verify that Nutch and all > it's plugins are running smoothly. > Yes smoke testing was honesty my main goal as I decided to explore, learn and work on this initiative. > But I have no good idea about packaging. All the Bigtop packages > are infrastructure providing core components or services. The > way how Nutch is used and deployed on a Hadoop cluster is all > in the "user space": jar and configuration files are specific > for this particular Nutch job setup and their classpath does not > interfere with that of other jobs. > Agreed. The issue (as you mention below) is that Nutch doesn't run as a persistent service such as Solr, Ranger, etc. Various packages (Debian, Ubuntu etc.) can be generated but some further thought needs to be put into package characteristics. > "Nutch server" would go easier as a Bigtop package. But Nutch > server was never designed to run on a cluster, just in local mode. > Correct. The Nutch OpenAPI specification at https://github.com/apache/nutch/pull/896 could be extended or remained to offer a Hadoop cluster interface. Packaging would then align with all existing Bigtop services. Thanks for chiming in. lewismc >

