lewismc commented on PR #1380:
URL: https://github.com/apache/bigtop/pull/1380#issuecomment-3978619822
# Testing the Nutch integration
This guidance is intended for peer reviewers interested in the Apache Nutch
integration in Bigtop. Nutch is built from source with Ant (`ant runtime`),
packaged using **runtime/deploy** for Hadoop cluster execution, and all smoke
tests run against a Hadoop cluster using HDFS.
## Prerequisites
- **JDK 8 or 11** – Required for Gradle and for the Nutch/Hadoop stack. Set
`JAVA_HOME` accordingly.
- **Hadoop cluster** – Smoke tests require a running cluster (HDFS and
YARN). They use `HADOOP_CONF_DIR` and will not run without it.
- **x86_64 Linux** – Building Nutch packages via `nutch-pkg-ind` uses the
Bigtop Docker slave image, which is only published for x86_64. On Apple Silicon
(arm64), the build script uses `--platform linux/amd64`; running amd64
containers under emulation can fail with "exec format error", so building
packages is most reliable on native x86_64 Linux.
## 1. Build the Nutch package
From the Bigtop repo root:
```bash
export JAVA_HOME=/path/to/jdk8-or-11
./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged"
```
To build Nutch and its dependencies (e.g. Hadoop) in Docker:
```bash
./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged"
-Dbuildwithdeps=true
```
Output appears under `build/nutch/` and `output/nutch/`. The installed Nutch
uses **runtime/deploy** (uber jar and scripts that run via `hadoop jar` on the
cluster).
## 2. Run the Nutch smoke tests
Smoke tests **require** a Hadoop cluster: they use HDFS for seed URLs,
crawldb, and segments, and they expect `HADOOP_CONF_DIR` to be set.
### On a host where Nutch and Hadoop are already installed
Set the environment and run the Nutch smoke tests:
```bash
export JAVA_HOME=/path/to/jdk
export HADOOP_CONF_DIR=/etc/hadoop/conf # or your cluster's conf dir
./gradlew bigtop-tests:smoke-tests:nutch:test -Psmoke.tests
```
Or from the smoke-tests directory:
```bash
cd bigtop-tests/smoke-tests
../../gradlew nutch:test -Psmoke.tests
```
Tests run in order: usage, inject subcommand, inject + readdb on HDFS, then
generate on HDFS. Cleanup removes `/user/root/nutch-smoke` from HDFS.
### Via Docker provisioner (full stack + smoke)
1. Build packages (with deps if needed) and enable the local repo in
`provisioner/docker/config.yaml`:
- `enable_local_repo: true`
- `nutch` is already in `components` and `smoke_test_components`.
2. From `provisioner/docker/`:
```bash
./docker-hadoop.sh --create 3 --smoke-tests
```
This provisions a cluster (including Nutch), then runs all smoke tests
(including Nutch). Ensure the provisioner has enough resources and that the
Nutch packages are present in the local repo (e.g. under `output/apt` or
equivalent).
## 3. Deploy Nutch with Puppet
To deploy Nutch on a Bigtop-managed cluster, include `nutch` in the cluster
components (e.g. in Hiera or `site.yaml`):
```yaml
hadoop_cluster_node::cluster_components:
- hdfs
- yarn
- mapreduce
- nutch
```
Nodes that receive the `nutch-client` role will have the Nutch package
installed and `/etc/default/nutch` configured with `NUTCH_HOME`,
`NUTCH_CONF_DIR`, and `HADOOP_CONF_DIR`. Run crawl commands (e.g. `nutch
inject`, `nutch generate`) from a gateway/client node against HDFS paths.
## 4. Quick sanity checks (no cluster)
Without a cluster you can still confirm that the test project loads and
compiles:
```bash
./gradlew bigtop-tests:smoke-tests:nutch:tasks
./gradlew bigtop-tests:smoke-tests:nutch:compileTestGroovy
```
The full test suite will not pass without `HADOOP_CONF_DIR` and a running
cluster.
## 5. What the smoke tests do
| Test | Description |
|------|-------------|
| `testNutchUsage` | Runs `nutch` with no arguments; expects exit 0 and
usage output. |
| `testNutchInjectSubcommand` | Runs `nutch inject` with no arguments;
expects non-zero exit and usage/error message. |
| `testNutchInjectAndReaddb` | Creates
`/user/root/nutch-smoke/urls/seed.txt` on HDFS, runs `nutch inject` and `nutch
readdb -stats` on HDFS paths, asserts stats output. |
| `testNutchGenerate` | Runs `nutch generate` with HDFS crawldb and segments
paths, then verifies at least one segment under the segments directory. |
All tests use the deploy runtime (cluster mode) and HDFS only; there are no
local-mode or `/tmp`-based crawl directories.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]