lewismc commented on PR #1380:
URL: https://github.com/apache/bigtop/pull/1380#issuecomment-3978619822

   # Testing the Nutch integration
   
   This guidance is intended for peer reviewers interested in the Apache Nutch 
integration in Bigtop. Nutch is built from source with Ant (`ant runtime`), 
packaged using **runtime/deploy** for Hadoop cluster execution, and all smoke 
tests run against a Hadoop cluster using HDFS.
   
   ## Prerequisites
   
   - **JDK 8 or 11** – Required for Gradle and for the Nutch/Hadoop stack. Set 
`JAVA_HOME` accordingly.
   - **Hadoop cluster** – Smoke tests require a running cluster (HDFS and 
YARN). They use `HADOOP_CONF_DIR` and will not run without it.
   - **x86_64 Linux** – Building Nutch packages via `nutch-pkg-ind` uses the 
Bigtop Docker slave image, which is only published for x86_64. On Apple Silicon 
(arm64), the build script uses `--platform linux/amd64`; running amd64 
containers under emulation can fail with "exec format error", so building 
packages is most reliable on native x86_64 Linux.
   
   ## 1. Build the Nutch package
   
   From the Bigtop repo root:
   
   ```bash
   export JAVA_HOME=/path/to/jdk8-or-11
   ./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged"
   ```
   
   To build Nutch and its dependencies (e.g. Hadoop) in Docker:
   
   ```bash
   ./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged" 
-Dbuildwithdeps=true
   ```
   
   Output appears under `build/nutch/` and `output/nutch/`. The installed Nutch 
uses **runtime/deploy** (uber jar and scripts that run via `hadoop jar` on the 
cluster).
   
   ## 2. Run the Nutch smoke tests
   
   Smoke tests **require** a Hadoop cluster: they use HDFS for seed URLs, 
crawldb, and segments, and they expect `HADOOP_CONF_DIR` to be set.
   
   ### On a host where Nutch and Hadoop are already installed
   
   Set the environment and run the Nutch smoke tests:
   
   ```bash
   export JAVA_HOME=/path/to/jdk
   export HADOOP_CONF_DIR=/etc/hadoop/conf   # or your cluster's conf dir
   ./gradlew bigtop-tests:smoke-tests:nutch:test -Psmoke.tests
   ```
   
   Or from the smoke-tests directory:
   
   ```bash
   cd bigtop-tests/smoke-tests
   ../../gradlew nutch:test -Psmoke.tests
   ```
   
   Tests run in order: usage, inject subcommand, inject + readdb on HDFS, then 
generate on HDFS. Cleanup removes `/user/root/nutch-smoke` from HDFS.
   
   ### Via Docker provisioner (full stack + smoke)
   
   1. Build packages (with deps if needed) and enable the local repo in 
`provisioner/docker/config.yaml`:
      - `enable_local_repo: true`
      - `nutch` is already in `components` and `smoke_test_components`.
   
   2. From `provisioner/docker/`:
   
      ```bash
      ./docker-hadoop.sh --create 3 --smoke-tests
      ```
   
      This provisions a cluster (including Nutch), then runs all smoke tests 
(including Nutch). Ensure the provisioner has enough resources and that the 
Nutch packages are present in the local repo (e.g. under `output/apt` or 
equivalent).
   
   ## 3. Deploy Nutch with Puppet
   
   To deploy Nutch on a Bigtop-managed cluster, include `nutch` in the cluster 
components (e.g. in Hiera or `site.yaml`):
   
   ```yaml
   hadoop_cluster_node::cluster_components:
     - hdfs
     - yarn
     - mapreduce
     - nutch
   ```
   
   Nodes that receive the `nutch-client` role will have the Nutch package 
installed and `/etc/default/nutch` configured with `NUTCH_HOME`, 
`NUTCH_CONF_DIR`, and `HADOOP_CONF_DIR`. Run crawl commands (e.g. `nutch 
inject`, `nutch generate`) from a gateway/client node against HDFS paths.
   
   ## 4. Quick sanity checks (no cluster)
   
   Without a cluster you can still confirm that the test project loads and 
compiles:
   
   ```bash
   ./gradlew bigtop-tests:smoke-tests:nutch:tasks
   ./gradlew bigtop-tests:smoke-tests:nutch:compileTestGroovy
   ```
   
   The full test suite will not pass without `HADOOP_CONF_DIR` and a running 
cluster.
   
   ## 5. What the smoke tests do
   
   | Test | Description |
   |------|-------------|
   | `testNutchUsage` | Runs `nutch` with no arguments; expects exit 0 and 
usage output. |
   | `testNutchInjectSubcommand` | Runs `nutch inject` with no arguments; 
expects non-zero exit and usage/error message. |
   | `testNutchInjectAndReaddb` | Creates 
`/user/root/nutch-smoke/urls/seed.txt` on HDFS, runs `nutch inject` and `nutch 
readdb -stats` on HDFS paths, asserts stats output. |
   | `testNutchGenerate` | Runs `nutch generate` with HDFS crawldb and segments 
paths, then verifies at least one segment under the segments directory. |
   
   All tests use the deploy runtime (cluster mode) and HDFS only; there are no 
local-mode or `/tmp`-based crawl directories.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to