Copilot commented on code in PR #2462:
URL: https://github.com/apache/tika/pull/2462#discussion_r2628822570


##########
tika-grpc/docker-build/README.md:
##########
@@ -0,0 +1,191 @@
+# Tika gRPC Docker Build
+
+This directory contains the Docker build configuration for Apache Tika gRPC 
server.
+
+## Overview
+
+The Docker image includes:
+- Tika gRPC server JAR
+- All Tika Pipes plugins (fetchers, emitters, iterators)
+- Parser packages (standard, extended, ML)
+- OCR support (Tesseract with multiple languages)
+- GDAL for geospatial formats
+- Common fonts
+
+## Building the Docker Image
+
+### Prerequisites
+
+1. Build Tika from the project root (this builds all modules including 
plugins):
+```bash
+cd <tika-root>
+mvn clean install -DskipTests
+```
+
+### Build Activation
+
+The Docker build can be activated in two ways:
+
+**Option 1: Using environment variables (recommended)**
+- Set `DOCKER_ID`, `AWS_ACCOUNT_ID`, or `AZURE_REGISTRY_NAME`
+- Maven profiles automatically detect these and enable the build
+- No need for `-Dskip.docker.build=false`
+
+**Option 2: Using Maven property**
+- Add `-Dskip.docker.build=false` to your Maven command
+- Use when you want explicit control or testing
+
+### Building from Tika Root
+
+**Build tika-grpc and dependencies only:**
+```bash
+DOCKER_ID=myusername \
+  mvn clean install -DskipTests -pl :tika-grpc -am
+```
+
+**Build entire project:**
+```bash
+DOCKER_ID=myusername \
+  mvn clean install -DskipTests
+```
+
+### Building from tika-grpc Directory
+
+#### Controlling Docker Build with Environment Variables
+
+All docker-build.sh environment variables are passed through from your shell. 
When these variables are set, the Maven profiles automatically activate the 
Docker build.
+
+**Build and push to Docker Hub:**
+```bash
+DOCKER_ID=myusername \
+  mvn package
+```
+
+**Build multi-arch and push to Docker Hub:**
+```bash
+MULTI_ARCH=true DOCKER_ID=myusername \
+  mvn package
+```
+
+**Build and push to AWS ECR:**
+```bash
+AWS_ACCOUNT_ID=123456789012 AWS_REGION=us-east-1 \
+  mvn package
+```
+
+**Build and push to Azure Container Registry:**
+```bash
+AZURE_REGISTRY_NAME=myregistry \
+  mvn package
+```
+
+**Note:** When environment variables are set, you don't need 
`-Dskip.docker.build=false`. The Maven profiles detect the variables and 
automatically enable the build.
+
+### Option 2: Run the Docker Build Script Manually
+
+Set the required environment variable and run the script:
+
+```bash
+export TIKA_VERSION=4.0.0-SNAPSHOT
+./tika-grpc/docker-build/docker-build.sh
+```
+
+### Optional Environment Variables
+
+- `TIKA_VERSION`: Maven project version (required)
+- `RELEASE_IMAGE_TAG`: Override the default tag (defaults to TIKA_VERSION 
without -SNAPSHOT)
+- `DOCKER_ID`: Docker Hub username to push to Docker Hub
+- `AWS_ACCOUNT_ID`: AWS account ID to push to ECR
+- `AWS_REGION`: AWS region for ECR (default: us-west-2)
+- `AZURE_REGISTRY_NAME`: Azure Container Registry name
+- `MULTI_ARCH`: Build for multiple architectures (default: false)
+- `PROJECT_NAME`: Docker image name (default: tika-grpc)
+
+### Examples
+
+**Build with Docker Hub using environment variable:**
+```bash
+DOCKER_ID=myusername \
+  mvn package
+```
+
+**Build multi-arch with Docker Hub:**
+```bash
+MULTI_ARCH=true DOCKER_ID=myusername \
+  mvn package
+```
+
+**Build with AWS ECR:**
+```bash
+AWS_ACCOUNT_ID=123456789012 AWS_REGION=us-east-1 \
+  mvn package
+```
+
+**Build with explicit property (for testing/development):**
+```bash
+mvn package -Dskip.docker.build=false -DDOCKER_ID=myusername

Review Comment:
   The command example is incorrect. The syntax '-DDOCKER_ID=myusername' won't 
work because Maven properties cannot set environment variables. The profiles 
are activated by environment variables (env.DOCKER_ID), not Maven properties. 
This command should either use 'DOCKER_ID=myusername mvn package 
-Dskip.docker.build=false' or just rely on the environment variable without the 
property override.
   ```suggestion
   DOCKER_ID=myusername mvn package -Dskip.docker.build=false
   ```



##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package 
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+    echo "ERROR: Docker is not installed or not in PATH. Please install Docker 
first."
+    exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+    echo "Environment variable TIKA_VERSION is required, and should match the 
maven project version of Tika"
+    exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+    RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+    ## Remove '-SNAPSHOT' from the version string
+    RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+    plugin_name=$(basename "$dir")
+    zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+    if [ -f "$zip_file" ]; then
+        cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+    else
+        echo "WARNING: Plugin file $zip_file does not exist, skipping."
+    fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+    "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+    "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+    package_name=$(basename "$parser_package")
+    jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+    if [ -f "$jar_file" ]; then
+        cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+    else
+        echo "Parser package file $jar_file does not exist, skipping."
+    fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"
+
+cp -v "tika-grpc/docker-build/Dockerfile" "${OUT_DIR}/Dockerfile"
+
+cd "${OUT_DIR}" || exit
+
+echo "Running docker build from directory: $(pwd)"
+
+IMAGE_TAGS=()
+if [[ -n "${AWS_ACCOUNT_ID}" ]]; then
+    if ! aws ecr get-login-password --region "${AWS_REGION}" | docker login 
--username AWS --password-stdin 
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"; then
+        echo "ERROR: Failed to authenticate with AWS ECR"
+        exit 1
+    fi
+    IMAGE_TAGS+=("-t 
${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${AZURE_REGISTRY_NAME}" ]]; then
+    if ! az acr login --name "${AZURE_REGISTRY_NAME}"; then
+        echo "ERROR: Failed to authenticate with Azure Container Registry"
+        exit 1
+    fi
+    IMAGE_TAGS+=("-t 
${AZURE_REGISTRY_NAME}.azurecr.io/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${DOCKER_ID}" ]]; then
+    IMAGE_TAGS+=("-t ${DOCKER_ID}/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [ ${#IMAGE_TAGS[@]} -eq 0 ]; then
+    echo "No image tags specified, skipping Docker build step. To enable 
build, set AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME, and/or DOCKER_ID environment 
variables."
+    exit 0
+fi
+
+tag="${IMAGE_TAGS[*]}"
+if [ "${MULTI_ARCH}" == "true" ]; then
+  echo "Building multi arch image"
+  docker buildx create --name tikabuilder
+  # Pin binfmt to a specific digest for security
+  # see 
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
+  docker run --rm --privileged 
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
 --install amd64
+  docker run --rm --privileged 
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
 --install arm64
+  docker buildx build \
+      --builder=tikabuilder . \
+      ${tag} \
+      --platform linux/amd64,linux/arm64 \
+      --push
+  docker buildx stop tikabuilder
+  docker buildx rm tikabuilder

Review Comment:
   The buildx builder should check if it already exists before attempting to 
create it. Running this script multiple times will fail because the builder 
'tikabuilder' already exists from a previous run. Consider adding a check to 
remove existing builder or use the existing one.
   ```suggestion
   BUILDX_BUILDER_NAME="tikabuilder"
   BUILDX_CREATED_BY_SCRIPT=false
   if [ "${MULTI_ARCH}" == "true" ]; then
     echo "Building multi arch image"
     if docker buildx inspect "${BUILDX_BUILDER_NAME}" >/dev/null 2>&1; then
       echo "Using existing Docker buildx builder '${BUILDX_BUILDER_NAME}'"
     else
       echo "Creating Docker buildx builder '${BUILDX_BUILDER_NAME}'"
       docker buildx create --name "${BUILDX_BUILDER_NAME}"
       BUILDX_CREATED_BY_SCRIPT=true
     fi
     # Pin binfmt to a specific digest for security
     # see 
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
     docker run --rm --privileged 
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
 --install amd64
     docker run --rm --privileged 
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
 --install arm64
     docker buildx build \
         --builder="${BUILDX_BUILDER_NAME}" . \
         ${tag} \
         --platform linux/amd64,linux/arm64 \
         --push
     if [ "${BUILDX_CREATED_BY_SCRIPT}" = "true" ]; then
       docker buildx stop "${BUILDX_BUILDER_NAME}"
       docker buildx rm "${BUILDX_BUILDER_NAME}"
     fi
   ```



##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package 
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+    echo "ERROR: Docker is not installed or not in PATH. Please install Docker 
first."
+    exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+    echo "Environment variable TIKA_VERSION is required, and should match the 
maven project version of Tika"
+    exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+    RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+    ## Remove '-SNAPSHOT' from the version string
+    RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+    plugin_name=$(basename "$dir")
+    zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+    if [ -f "$zip_file" ]; then
+        cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+    else
+        echo "WARNING: Plugin file $zip_file does not exist, skipping."
+    fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+    "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+    "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+    package_name=$(basename "$parser_package")
+    jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+    if [ -f "$jar_file" ]; then
+        cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+    else
+        echo "Parser package file $jar_file does not exist, skipping."
+    fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"
+
+cp -v "tika-grpc/docker-build/Dockerfile" "${OUT_DIR}/Dockerfile"

Review Comment:
   Missing error handling for the 'cp' commands. If the source JAR file doesn't 
exist, the script will continue and produce a broken Docker image. The 'cp' 
command should be followed by error checking or use 'set -e' at the beginning 
of the script to exit on any error.



##########
tika-grpc/docker-build/Dockerfile:
##########
@@ -0,0 +1,39 @@
+FROM ubuntu:22.04
+COPY libs/ /tika/libs/
+COPY plugins/ /tika/plugins/
+COPY config/ /tika/config/
+COPY bin/ /tika/bin

Review Comment:
   The Dockerfile copies files before installing system dependencies. If the 
apt-get update or installation fails, you'll need to rebuild from scratch 
including the COPY operations. Consider moving COPY commands after the RUN 
command to improve Docker build cache efficiency.



##########
tika-grpc/pom.xml:
##########
@@ -372,20 +409,65 @@
           </execution>
         </executions>
       </plugin>
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-antrun-plugin</artifactId>
+        <version>3.1.0</version>
+        <executions>
+          <execution>
+            <id>set-chmod-on-docker-build-sh</id>
+            <phase>validate</phase>
+            <goals>
+              <goal>run</goal>
+            </goals>
+            <configuration>
+              <target>
+                <chmod file="${project.basedir}/docker-build/docker-build.sh" 
perm="755" failonerror="false"/>
+              </target>
+              <skip>${skip.docker.build}</skip>
+            </configuration>
+          </execution>

Review Comment:
   The Maven antrun plugin execution runs in the 'validate' phase which occurs 
before the 'package' phase where the docker-build.sh script runs. This means 
the chmod might run on a file that doesn't exist yet if building from a clean 
state. However, since failonerror is set to false, this won't cause a build 
failure. Consider moving this to an earlier phase like 'initialize' or document 
why validate phase is chosen, or rely on git permissions for the script.



##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package 
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+    echo "ERROR: Docker is not installed or not in PATH. Please install Docker 
first."
+    exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+    echo "Environment variable TIKA_VERSION is required, and should match the 
maven project version of Tika"
+    exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+    RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+    ## Remove '-SNAPSHOT' from the version string
+    RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+    plugin_name=$(basename "$dir")
+    zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+    if [ -f "$zip_file" ]; then
+        cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+    else
+        echo "WARNING: Plugin file $zip_file does not exist, skipping."
+    fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+    "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+    "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+    package_name=$(basename "$parser_package")
+    jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+    if [ -f "$jar_file" ]; then
+        cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+    else
+        echo "Parser package file $jar_file does not exist, skipping."
+    fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"
+
+cp -v "tika-grpc/docker-build/Dockerfile" "${OUT_DIR}/Dockerfile"
+
+cd "${OUT_DIR}" || exit
+
+echo "Running docker build from directory: $(pwd)"
+
+IMAGE_TAGS=()
+if [[ -n "${AWS_ACCOUNT_ID}" ]]; then
+    if ! aws ecr get-login-password --region "${AWS_REGION}" | docker login 
--username AWS --password-stdin 
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"; then
+        echo "ERROR: Failed to authenticate with AWS ECR"
+        exit 1
+    fi
+    IMAGE_TAGS+=("-t 
${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${AZURE_REGISTRY_NAME}" ]]; then
+    if ! az acr login --name "${AZURE_REGISTRY_NAME}"; then
+        echo "ERROR: Failed to authenticate with Azure Container Registry"
+        exit 1
+    fi
+    IMAGE_TAGS+=("-t 
${AZURE_REGISTRY_NAME}.azurecr.io/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${DOCKER_ID}" ]]; then
+    IMAGE_TAGS+=("-t ${DOCKER_ID}/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [ ${#IMAGE_TAGS[@]} -eq 0 ]; then
+    echo "No image tags specified, skipping Docker build step. To enable 
build, set AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME, and/or DOCKER_ID environment 
variables."
+    exit 0
+fi
+
+tag="${IMAGE_TAGS[*]}"
+if [ "${MULTI_ARCH}" == "true" ]; then
+  echo "Building multi arch image"
+  docker buildx create --name tikabuilder
+  # Pin binfmt to a specific digest for security
+  # see 
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147

Review Comment:
   The hardcoded SHA256 digest is pinned to a specific version of the binfmt 
image. This digest may become outdated over time. While pinning provides 
security benefits, there should be a comment indicating when this digest was 
last verified or consider using a tagged version that's more maintainable.
   ```suggestion
     # see 
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
     # Digest last verified on 2025-01-10; update periodically if 
tonistiigi/binfmt:latest is refreshed.
   ```



##########
tika-grpc/docker-build/start-tika-grpc.sh:
##########
@@ -0,0 +1,29 @@
+#!/bin/bash
+echo "Tika Version:"
+echo "${TIKA_VERSION}"
+echo "Tika Plugins:"
+ls "/tika/plugins"
+echo "Tika gRPC Max Inbound Message Size:"
+echo "${TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE}"
+echo "Tika gRPC Max Outbound Message Size:"
+echo "${TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE}"
+echo "Tika gRPC Num Threads:"
+echo "${TIKA_GRPC_NUM_THREADS}"
+exec java \
+  -Dgrpc.server.port=9090 \
+  
"-Dgrpc.server.max-inbound-message-size=${TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE}" \
+  
"-Dgrpc.server.max-outbound-message-size=${TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE}"
 \
+  "-Dgrpc.server.numThreads=${TIKA_GRPC_NUM_THREADS}" \
+  --add-opens=jdk.management/com.sun.management.internal=ALL-UNNAMED \
+  --add-opens=java.base/jdk.internal.misc=ALL-UNNAMED \
+  --add-opens=java.base/sun.nio.ch=ALL-UNNAMED \
+  --add-opens=java.management/com.sun.jmx.mbeanserver=ALL-UNNAMED \
+  --add-opens=jdk.internal.jvmstat/sun.jvmstat.monitor=ALL-UNNAMED \
+  --add-opens=java.base/sun.reflect.generics.reflectiveObjects=ALL-UNNAMED \
+  --add-opens=java.base/java.io=ALL-UNNAMED \
+  --add-opens=java.base/java.nio=ALL-UNNAMED \
+  --add-opens=java.base/java.util=ALL-UNNAMED \
+  --add-opens=java.base/java.lang=ALL-UNNAMED \
+  -Djava.net.preferIPv4Stack=true \
+  "-Dplugins.pluginDirs=/tika/plugins" \
+  -jar "/tika/libs/tika-grpc-${TIKA_VERSION}.jar"

Review Comment:
   The script doesn't validate that TIKA_VERSION is set before using it in the 
jar path. If TIKA_VERSION is empty (which it will be since the Dockerfile ARG 
VERSION is never passed), the java command will fail with a confusing error. 
Add validation at the start of the script to check that TIKA_VERSION is set and 
not empty.



##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package 
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+    echo "ERROR: Docker is not installed or not in PATH. Please install Docker 
first."
+    exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+    echo "Environment variable TIKA_VERSION is required, and should match the 
maven project version of Tika"
+    exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+    RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+    ## Remove '-SNAPSHOT' from the version string
+    RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+    plugin_name=$(basename "$dir")
+    zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+    if [ -f "$zip_file" ]; then
+        cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+    else
+        echo "WARNING: Plugin file $zip_file does not exist, skipping."
+    fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+    "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+    "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+    package_name=$(basename "$parser_package")
+    jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+    if [ -f "$jar_file" ]; then
+        cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+    else
+        echo "Parser package file $jar_file does not exist, skipping."
+    fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"
+
+cp -v "tika-grpc/docker-build/Dockerfile" "${OUT_DIR}/Dockerfile"
+
+cd "${OUT_DIR}" || exit
+
+echo "Running docker build from directory: $(pwd)"
+
+IMAGE_TAGS=()
+if [[ -n "${AWS_ACCOUNT_ID}" ]]; then
+    if ! aws ecr get-login-password --region "${AWS_REGION}" | docker login 
--username AWS --password-stdin 
"${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"; then
+        echo "ERROR: Failed to authenticate with AWS ECR"
+        exit 1
+    fi
+    IMAGE_TAGS+=("-t 
${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${AZURE_REGISTRY_NAME}" ]]; then
+    if ! az acr login --name "${AZURE_REGISTRY_NAME}"; then
+        echo "ERROR: Failed to authenticate with Azure Container Registry"
+        exit 1
+    fi
+    IMAGE_TAGS+=("-t 
${AZURE_REGISTRY_NAME}.azurecr.io/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [[ -n "${DOCKER_ID}" ]]; then
+    IMAGE_TAGS+=("-t ${DOCKER_ID}/${PROJECT_NAME}:${RELEASE_IMAGE_TAG}")
+fi
+
+if [ ${#IMAGE_TAGS[@]} -eq 0 ]; then
+    echo "No image tags specified, skipping Docker build step. To enable 
build, set AWS_ACCOUNT_ID, AZURE_REGISTRY_NAME, and/or DOCKER_ID environment 
variables."
+    exit 0
+fi
+
+tag="${IMAGE_TAGS[*]}"
+if [ "${MULTI_ARCH}" == "true" ]; then
+  echo "Building multi arch image"
+  docker buildx create --name tikabuilder
+  # Pin binfmt to a specific digest for security
+  # see 
https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
+  docker run --rm --privileged 
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
 --install amd64
+  docker run --rm --privileged 
tonistiigi/binfmt:latest@sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
 --install arm64
+  docker buildx build \
+      --builder=tikabuilder . \
+      ${tag} \
+      --platform linux/amd64,linux/arm64 \
+      --push
+  docker buildx stop tikabuilder
+  docker buildx rm tikabuilder
+else
+  echo "Building single arch image"
+  # build single arch
+  docker build . ${tag}

Review Comment:
   The docker build commands don't include error checking. The multi-arch build 
uses --push which will fail silently if the build succeeds but push fails. The 
single-arch build doesn't verify if the build succeeded. Add error checking 
after docker build commands to ensure build success.
   ```suggestion
     if ! docker buildx build \
         --builder=tikabuilder . \
         ${tag} \
         --platform linux/amd64,linux/arm64 \
         --push; then
       echo "ERROR: Docker multi-arch build or push failed"
       # Attempt to clean up the buildx builder even on failure
       docker buildx stop tikabuilder >/dev/null 2>&1 || true
       docker buildx rm tikabuilder >/dev/null 2>&1 || true
       exit 1
     fi
     docker buildx stop tikabuilder
     docker buildx rm tikabuilder
   else
     echo "Building single arch image"
     # build single arch
     if ! docker build . ${tag}; then
       echo "ERROR: Docker single-arch build failed"
       exit 1
     fi
   ```



##########
tika-grpc/docker-build/Dockerfile:
##########
@@ -0,0 +1,39 @@
+FROM ubuntu:22.04
+COPY libs/ /tika/libs/
+COPY plugins/ /tika/plugins/
+COPY config/ /tika/config/
+COPY bin/ /tika/bin
+ARG JRE='openjdk-17-jre-headless'
+ARG VERSION
+ARG TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE=104857600
+ARG TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE=104857600
+ARG TIKA_GRPC_NUM_THREADS=4
+RUN set -eux \
+    && apt-get update \
+    && apt-get install --yes --no-install-recommends gnupg2 
software-properties-common \
+    && DEBIAN_FRONTEND=noninteractive apt-get install --yes 
--no-install-recommends $JRE \
+        gdal-bin \
+        tesseract-ocr \
+        tesseract-ocr-eng \
+        tesseract-ocr-ita \
+        tesseract-ocr-fra \
+        tesseract-ocr-spa \
+        tesseract-ocr-deu \
+    && echo ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula 
select true | debconf-set-selections \
+    && DEBIAN_FRONTEND=noninteractive apt-get install --yes 
--no-install-recommends \
+        xfonts-utils \
+        fonts-freefont-ttf \
+        fonts-liberation \
+        ttf-mscorefonts-installer \
+        wget \
+        cabextract \
+    && apt-get clean -y \
+    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
+
+EXPOSE 9090
+ENV TIKA_VERSION=$VERSION
+ENV TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE=$TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE
+ENV TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE=$TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE
+ENV TIKA_GRPC_NUM_THREADS=$TIKA_GRPC_NUM_THREADS
+RUN chmod +x "/tika/bin/start-tika-grpc.sh"

Review Comment:
   The Dockerfile doesn't specify a user, so the container will run as root by 
default. This is a security risk. Consider adding a non-root user to run the 
Java application, similar to best practices for production Docker images.
   ```suggestion
   RUN chmod +x "/tika/bin/start-tika-grpc.sh" \
       && groupadd --system tika \
       && useradd --system --no-create-home --gid tika --shell 
/usr/sbin/nologin tika \
       && chown -R tika:tika /tika
   USER tika
   ```



##########
tika-grpc/docker-build/Dockerfile:
##########
@@ -0,0 +1,39 @@
+FROM ubuntu:22.04
+COPY libs/ /tika/libs/
+COPY plugins/ /tika/plugins/
+COPY config/ /tika/config/
+COPY bin/ /tika/bin
+ARG JRE='openjdk-17-jre-headless'
+ARG VERSION
+ARG TIKA_GRPC_MAX_INBOUND_MESSAGE_SIZE=104857600
+ARG TIKA_GRPC_MAX_OUTBOUND_MESSAGE_SIZE=104857600
+ARG TIKA_GRPC_NUM_THREADS=4
+RUN set -eux \
+    && apt-get update \
+    && apt-get install --yes --no-install-recommends gnupg2 
software-properties-common \
+    && DEBIAN_FRONTEND=noninteractive apt-get install --yes 
--no-install-recommends $JRE \
+        gdal-bin \
+        tesseract-ocr \
+        tesseract-ocr-eng \
+        tesseract-ocr-ita \
+        tesseract-ocr-fra \
+        tesseract-ocr-spa \
+        tesseract-ocr-deu \
+    && echo ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula 
select true | debconf-set-selections \
+    && DEBIAN_FRONTEND=noninteractive apt-get install --yes 
--no-install-recommends \
+        xfonts-utils \
+        fonts-freefont-ttf \
+        fonts-liberation \
+        ttf-mscorefonts-installer \
+        wget \
+        cabextract \
+    && apt-get clean -y \
+    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
+
+EXPOSE 9090
+ENV TIKA_VERSION=$VERSION

Review Comment:
   The Dockerfile uses an ARG VERSION that is defined but never set during the 
build process. The docker-build.sh script doesn't pass this build argument, 
which means TIKA_VERSION environment variable will be empty in the resulting 
image. Add --build-arg VERSION="${TIKA_VERSION}" to the docker build and docker 
buildx build commands.



##########
tika-grpc/docker-build/docker-build.sh:
##########
@@ -0,0 +1,127 @@
+#!/bin/bash
+# This script is intended to be run from Maven exec plugin during the package 
phase of maven build
+
+# Check if Docker is installed
+if ! command -v docker &> /dev/null; then
+    echo "ERROR: Docker is not installed or not in PATH. Please install Docker 
first."
+    exit 1
+fi
+
+if [ -z "${TIKA_VERSION}" ]; then
+    echo "Environment variable TIKA_VERSION is required, and should match the 
maven project version of Tika"
+    exit 1
+fi
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+cd "${SCRIPT_DIR}/../../" || exit
+
+OUT_DIR=target/tika-docker
+
+MULTI_ARCH=${MULTI_ARCH:-false}
+AWS_REGION=${AWS_REGION:-us-west-2}
+AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-}
+AZURE_REGISTRY_NAME=${AZURE_REGISTRY_NAME:-}
+DOCKER_ID=${DOCKER_ID:-}
+PROJECT_NAME=${PROJECT_NAME:-tika-grpc}
+
+# If RELEASE_IMAGE_TAG not specified, use TIKA_VERSION
+if [[ -z "${RELEASE_IMAGE_TAG}" ]]; then
+    RELEASE_IMAGE_TAG="${TIKA_VERSION}"
+    ## Remove '-SNAPSHOT' from the version string
+    RELEASE_IMAGE_TAG="${RELEASE_IMAGE_TAG//-SNAPSHOT/}"
+fi
+
+mkdir -p "${OUT_DIR}/libs"
+mkdir -p "${OUT_DIR}/plugins"
+mkdir -p "${OUT_DIR}/config"
+mkdir -p "${OUT_DIR}/bin"
+cp -v -r "tika-grpc/target/tika-grpc-${TIKA_VERSION}.jar" "${OUT_DIR}/libs"
+
+# Copy all tika-pipes plugin zip files
+for dir in tika-pipes/tika-pipes-plugins/*/; do
+    plugin_name=$(basename "$dir")
+    zip_file="${dir}target/${plugin_name}-${TIKA_VERSION}.zip"
+    if [ -f "$zip_file" ]; then
+        cp -v -r "$zip_file" "${OUT_DIR}/plugins"
+    else
+        echo "WARNING: Plugin file $zip_file does not exist, skipping."
+    fi
+done
+
+# Copy parser package jars as plugins
+parser_packages=(
+    "tika-parsers/tika-parsers-standard/tika-parsers-standard-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-scientific-package"
+    "tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package"
+    "tika-parsers/tika-parsers-ml/tika-parser-nlp-package"
+)
+
+for parser_package in "${parser_packages[@]}"; do
+    package_name=$(basename "$parser_package")
+    jar_file="${parser_package}/target/${package_name}-${TIKA_VERSION}.jar"
+    if [ -f "$jar_file" ]; then
+        cp -v -r "$jar_file" "${OUT_DIR}/plugins"
+    else
+        echo "Parser package file $jar_file does not exist, skipping."
+    fi
+done
+
+cp -v -r "tika-grpc/docker-build/start-tika-grpc.sh" "${OUT_DIR}/bin"

Review Comment:
   The script uses 'cp -v -r' on files (not directories) which doesn't require 
the '-r' flag. The '-r' flag is only necessary for recursive directory copying. 
This is used on lines 39, 46, 64, and 70. Remove '-r' for single file 
operations to avoid confusion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to