nsivabalan commented on code in PR #18939:
URL: https://github.com/apache/hudi/pull/18939#discussion_r3487471079


##########
hudi-notebooks/notebooks/spark4/utils.py:
##########
@@ -0,0 +1,262 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Spark 4 notebook helpers. Copied into the notebook home as utils.py when
+# building the apachehudi/spark4-hudi (Spark 4) image.
+
+import os
+import urllib.request
+
+from pyspark.sql import SparkSession
+from IPython.display import HTML, display as display_html
+
+_spark = None
+
+# Default number of rows to show in display()
+DEFAULT_DISPLAY_ROWS = 100
+
+# Reused HTML/CSS for DataFrame display (avoids string rebuild on every call)
+_DISPLAY_TABLE_CSS = """
+<style>
+    .dataframe {
+        border-radius: 0.5rem;
+        box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 
0, 0, 0.06);
+        overflow-x: auto;
+        border: 1px solid #e2e8f0;
+    }
+    .dataframe th {
+        background-color: #f1f5f9;
+        color: #1f2937;
+        font-weight: 600;
+        padding: 0.75rem 1.5rem;
+        text-align: left;
+        border-bottom: 2px solid #e2e8f0;
+    }
+    .dataframe td {
+        padding: 0.75rem 1.5rem;
+        border-bottom: 1px solid #e2e8f0;
+    }
+    .dataframe tr:nth-child(even) {
+        background-color: #f8fafc;
+    }
+    .dataframe tr:hover {
+        background-color: #e2e8f0;
+        transition: background-color 0.2s ease-in-out;
+    }
+</style>
+"""
+
+
+def get_spark_session(
+    app_name="Hudi-Notebooks",
+    log_level="WARN",
+    hudi_version=None,
+):
+    """
+    Initialize a SparkSession (singleton).
+
+    Connects to the in-container Spark standalone master started by 
entrypoint.sh.
+    The master URL, driver memory (4g) and executor memory (4g) are configured 
in
+    $SPARK_HOME/conf/spark-defaults.conf.
+
+    Parameters:
+    - app_name (str): Optional name for the Spark application.
+    - log_level (str): Log level for Spark (DEBUG, INFO, WARN, ERROR). 
Defaults to WARN.
+    - hudi_version (str): Hudi bundle version. Defaults to the HUDI_VERSION 
baked into
+      the image (1.1.1 on the Spark 4 image).
+
+    Returns:
+    - SparkSession object
+    """
+    global _spark
+
+    if _spark is not None:
+        return _spark
+
+    if hudi_version is None:
+        hudi_version = os.getenv("HUDI_VERSION", "1.1.1")
+
+    hudi_home = os.getenv("HUDI_HOME", "/opt/hudi")
+    spark_version = os.getenv("SPARK_VERSION", "4.0.2")
+    spark_minor_version = ".".join(spark_version.split(".")[:2])
+    scala_version = os.getenv("SCALA_VERSION", "2.13")
+    bundle_name = f"hudi-spark{spark_minor_version}-bundle_{scala_version}"
+    bundle_jar = f"{bundle_name}-{hudi_version}.jar"
+
+    # Resolve the Hudi bundle to a local path. The image pre-downloads it under
+    # $HUDI_HOME; if it is missing, fetch it from Maven Central now.
+    hudi_local_jar = os.path.join(hudi_home, hudi_version, bundle_jar)
+    if not os.path.exists(hudi_local_jar):
+        os.makedirs(os.path.dirname(hudi_local_jar), exist_ok=True)
+        hudi_jar_url = (
+            f"https://repo1.maven.org/maven2/org/apache/hudi/";
+            f"{bundle_name}/{hudi_version}/{bundle_jar}"
+        )
+        print(f"Hudi bundle not found at {hudi_local_jar}; downloading 
{hudi_jar_url} ...")
+        urllib.request.urlretrieve(hudi_jar_url, hudi_local_jar)
+        
+    lance_version = "0.5.0"
+    lance_home = "/opt/lance"
+    lance_jar = 
f"lance-spark-bundle-{spark_minor_version}_{scala_version}-{lance_version}.jar"
+    lance_local_jar = os.path.join(lance_home, lance_version, lance_jar)
+    if not os.path.exists(lance_local_jar):
+        os.makedirs(os.path.dirname(lance_local_jar), exist_ok=True)
+        lance_jar_url = (
+            
f"https://repo1.maven.org/maven2/org/lance/lance-spark-bundle-{spark_minor_version}_{scala_version}/{lance_version}/{lance_jar}";
+        )
+        print(f"Lance bundle not found at {lance_local_jar}; downloading 
{lance_jar_url} ...")
+        urllib.request.urlretrieve(lance_jar_url, lance_local_jar)

Review Comment:
   The Lance bundle is downloaded from Maven Central on every Spark 4 session 
start (unconditionally, no retry/backoff), and added to every session's 
`extraClassPath`. Concerns:
   
   1. First-time session startup pays a network round-trip — air-gapped demo 
environments can't run the Spark 4 image even though the Hudi bundle is 
pre-baked.
   2. Only `09_hudi_1_2_0_features.ipynb` (the Lance-interop section near line 
3479) actually needs Lance; every other Spark 4 session pays the cost.
   3. Lance isn't mentioned in the PR description's "Spark 4 stack" section — 
it sneaked into `utils.py` as a notebook dependency.
   
   Could we either:
   - Pre-bake the Lance jar in `Dockerfile.spark4` (mirror the Hudi bundle 
handling at line 50), or
   - Gate the Lance download behind a flag the caller opts into (e.g. 
`get_spark_session(include_lance=True)`), so only the one notebook that needs 
it loads it?
   
   The second option also makes the dependency explicit in the notebook that 
uses it, which is good demo hygiene.



##########
hudi-notebooks/notebooks/spark4/utils.py:
##########
@@ -0,0 +1,262 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Spark 4 notebook helpers. Copied into the notebook home as utils.py when
+# building the apachehudi/spark4-hudi (Spark 4) image.
+
+import os
+import urllib.request
+
+from pyspark.sql import SparkSession
+from IPython.display import HTML, display as display_html
+
+_spark = None
+
+# Default number of rows to show in display()
+DEFAULT_DISPLAY_ROWS = 100
+
+# Reused HTML/CSS for DataFrame display (avoids string rebuild on every call)
+_DISPLAY_TABLE_CSS = """
+<style>
+    .dataframe {
+        border-radius: 0.5rem;
+        box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 
0, 0, 0.06);
+        overflow-x: auto;
+        border: 1px solid #e2e8f0;
+    }
+    .dataframe th {
+        background-color: #f1f5f9;
+        color: #1f2937;
+        font-weight: 600;
+        padding: 0.75rem 1.5rem;
+        text-align: left;
+        border-bottom: 2px solid #e2e8f0;
+    }
+    .dataframe td {
+        padding: 0.75rem 1.5rem;
+        border-bottom: 1px solid #e2e8f0;
+    }
+    .dataframe tr:nth-child(even) {
+        background-color: #f8fafc;
+    }
+    .dataframe tr:hover {
+        background-color: #e2e8f0;
+        transition: background-color 0.2s ease-in-out;
+    }
+</style>
+"""
+
+
+def get_spark_session(
+    app_name="Hudi-Notebooks",
+    log_level="WARN",
+    hudi_version=None,
+):
+    """
+    Initialize a SparkSession (singleton).
+
+    Connects to the in-container Spark standalone master started by 
entrypoint.sh.
+    The master URL, driver memory (4g) and executor memory (4g) are configured 
in
+    $SPARK_HOME/conf/spark-defaults.conf.
+
+    Parameters:
+    - app_name (str): Optional name for the Spark application.
+    - log_level (str): Log level for Spark (DEBUG, INFO, WARN, ERROR). 
Defaults to WARN.
+    - hudi_version (str): Hudi bundle version. Defaults to the HUDI_VERSION 
baked into
+      the image (1.1.1 on the Spark 4 image).
+
+    Returns:
+    - SparkSession object
+    """
+    global _spark
+
+    if _spark is not None:
+        return _spark
+
+    if hudi_version is None:
+        hudi_version = os.getenv("HUDI_VERSION", "1.1.1")
+
+    hudi_home = os.getenv("HUDI_HOME", "/opt/hudi")
+    spark_version = os.getenv("SPARK_VERSION", "4.0.2")
+    spark_minor_version = ".".join(spark_version.split(".")[:2])
+    scala_version = os.getenv("SCALA_VERSION", "2.13")
+    bundle_name = f"hudi-spark{spark_minor_version}-bundle_{scala_version}"
+    bundle_jar = f"{bundle_name}-{hudi_version}.jar"
+
+    # Resolve the Hudi bundle to a local path. The image pre-downloads it under
+    # $HUDI_HOME; if it is missing, fetch it from Maven Central now.
+    hudi_local_jar = os.path.join(hudi_home, hudi_version, bundle_jar)
+    if not os.path.exists(hudi_local_jar):
+        os.makedirs(os.path.dirname(hudi_local_jar), exist_ok=True)
+        hudi_jar_url = (
+            f"https://repo1.maven.org/maven2/org/apache/hudi/";
+            f"{bundle_name}/{hudi_version}/{bundle_jar}"
+        )
+        print(f"Hudi bundle not found at {hudi_local_jar}; downloading 
{hudi_jar_url} ...")
+        urllib.request.urlretrieve(hudi_jar_url, hudi_local_jar)
+        

Review Comment:
   nit: line 110 has trailing whitespace, and the Lance block (111-121) uses 
inconsistent indentation relative to the surrounding function body. Worth a 
quick clean-up while addressing the broader Lance concern above.



##########
hudi-notebooks/notebooks/spark4/utils.py:
##########
@@ -0,0 +1,262 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Spark 4 notebook helpers. Copied into the notebook home as utils.py when
+# building the apachehudi/spark4-hudi (Spark 4) image.
+
+import os
+import urllib.request
+
+from pyspark.sql import SparkSession
+from IPython.display import HTML, display as display_html
+
+_spark = None
+
+# Default number of rows to show in display()
+DEFAULT_DISPLAY_ROWS = 100
+
+# Reused HTML/CSS for DataFrame display (avoids string rebuild on every call)
+_DISPLAY_TABLE_CSS = """
+<style>
+    .dataframe {
+        border-radius: 0.5rem;
+        box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 
0, 0, 0.06);
+        overflow-x: auto;
+        border: 1px solid #e2e8f0;
+    }
+    .dataframe th {
+        background-color: #f1f5f9;
+        color: #1f2937;
+        font-weight: 600;
+        padding: 0.75rem 1.5rem;
+        text-align: left;
+        border-bottom: 2px solid #e2e8f0;
+    }
+    .dataframe td {
+        padding: 0.75rem 1.5rem;
+        border-bottom: 1px solid #e2e8f0;
+    }
+    .dataframe tr:nth-child(even) {
+        background-color: #f8fafc;
+    }
+    .dataframe tr:hover {
+        background-color: #e2e8f0;
+        transition: background-color 0.2s ease-in-out;
+    }
+</style>
+"""
+
+
+def get_spark_session(
+    app_name="Hudi-Notebooks",
+    log_level="WARN",
+    hudi_version=None,
+):
+    """
+    Initialize a SparkSession (singleton).
+
+    Connects to the in-container Spark standalone master started by 
entrypoint.sh.
+    The master URL, driver memory (4g) and executor memory (4g) are configured 
in
+    $SPARK_HOME/conf/spark-defaults.conf.
+
+    Parameters:
+    - app_name (str): Optional name for the Spark application.
+    - log_level (str): Log level for Spark (DEBUG, INFO, WARN, ERROR). 
Defaults to WARN.
+    - hudi_version (str): Hudi bundle version. Defaults to the HUDI_VERSION 
baked into
+      the image (1.1.1 on the Spark 4 image).
+
+    Returns:
+    - SparkSession object
+    """
+    global _spark
+
+    if _spark is not None:
+        return _spark
+
+    if hudi_version is None:
+        hudi_version = os.getenv("HUDI_VERSION", "1.1.1")
+
+    hudi_home = os.getenv("HUDI_HOME", "/opt/hudi")
+    spark_version = os.getenv("SPARK_VERSION", "4.0.2")
+    spark_minor_version = ".".join(spark_version.split(".")[:2])
+    scala_version = os.getenv("SCALA_VERSION", "2.13")
+    bundle_name = f"hudi-spark{spark_minor_version}-bundle_{scala_version}"
+    bundle_jar = f"{bundle_name}-{hudi_version}.jar"
+
+    # Resolve the Hudi bundle to a local path. The image pre-downloads it under
+    # $HUDI_HOME; if it is missing, fetch it from Maven Central now.
+    hudi_local_jar = os.path.join(hudi_home, hudi_version, bundle_jar)
+    if not os.path.exists(hudi_local_jar):
+        os.makedirs(os.path.dirname(hudi_local_jar), exist_ok=True)
+        hudi_jar_url = (
+            f"https://repo1.maven.org/maven2/org/apache/hudi/";
+            f"{bundle_name}/{hudi_version}/{bundle_jar}"
+        )
+        print(f"Hudi bundle not found at {hudi_local_jar}; downloading 
{hudi_jar_url} ...")
+        urllib.request.urlretrieve(hudi_jar_url, hudi_local_jar)
+        
+    lance_version = "0.5.0"
+    lance_home = "/opt/lance"
+    lance_jar = 
f"lance-spark-bundle-{spark_minor_version}_{scala_version}-{lance_version}.jar"
+    lance_local_jar = os.path.join(lance_home, lance_version, lance_jar)
+    if not os.path.exists(lance_local_jar):
+        os.makedirs(os.path.dirname(lance_local_jar), exist_ok=True)
+        lance_jar_url = (
+            
f"https://repo1.maven.org/maven2/org/lance/lance-spark-bundle-{spark_minor_version}_{scala_version}/{lance_version}/{lance_jar}";
+        )
+        print(f"Lance bundle not found at {lance_local_jar}; downloading 
{lance_jar_url} ...")
+        urllib.request.urlretrieve(lance_jar_url, lance_local_jar)
+    
+    extraclasspath = f"{hudi_local_jar}:{lance_local_jar}"
+    _spark = (
+        SparkSession.builder.appName(app_name)
+        .config("spark.driver.extraClassPath", extraclasspath)
+        .config("spark.executor.extraClassPath", extraclasspath)
+        .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
+        .config("spark.sql.extensions", 
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
+        .config("spark.sql.catalog.spark_catalog", 
"org.apache.spark.sql.hudi.catalog.HoodieCatalog")
+        .config("spark.kryo.registrator", 
"org.apache.spark.HoodieSparkKryoRegistrar")
+        .enableHiveSupport()
+        .getOrCreate()
+    )
+
+    _spark.sparkContext.setLogLevel(log_level)
+    print(
+        f"SparkSession started with app name: {app_name}, "
+        f"Spark version: {spark_version}, Hudi version: {hudi_version}"
+    )
+
+    return _spark
+
+
+def stop_spark_session():
+    """Stop the global SparkSession and clear the singleton."""
+    global _spark
+    if _spark is not None:
+        _spark.stop()
+        _spark = None
+        print("SparkSession stopped successfully.")
+
+
+def ls(base_path):
+    """
+    List files or directories at the given MinIO S3 path.
+
+    Args:
+        base_path: Path starting with 's3a://' (e.g. 
s3a://warehouse/hudi_table/).
+    """
+    #if not base_path.startswith("s3a://"):
+    #    raise ValueError("Path must start with 's3a://'")
+
+    global _spark
+    if _spark is None:
+        raise RuntimeError("SparkSession not initialized. Call 
get_spark_session() first.")
+
+    try:
+        hadoop_conf = _spark._jsc.hadoopConfiguration()
+        fs = _spark._jvm.org.apache.hadoop.fs.FileSystem.get(hadoop_conf)
+        p = _spark._jvm.org.apache.hadoop.fs.Path(base_path)
+        if not fs.exists(p):
+            print(f"Path does not exist: {base_path}")
+            return []
+        status = fs.listStatus(p)
+        files = [str(file.getPath()) for file in status]
+        for f in files:
+            print(f)
+    except Exception as e:
+        print(f"Exception occurred while listing files from path {base_path}", 
e)
+
+
+def drop_table(table_name: str = None, table_path: str = None):
+    try:
+        _spark.sql(f"DROP TABLE IF EXISTS {table_name}")
+        print(f"✓ Table '{table_name}' dropped successfully.")

Review Comment:
   Unlike `ls()` at line 164-167, `drop_table` (and `create_table` at 195, 
`desc_table` at 208, `get_count` at 215, `insert_data` at 222) references the 
module-global `_spark` without checking whether `get_spark_session()` has been 
called. If a user invokes any of these helpers before initializing the session, 
they get a confusing `AttributeError: 'NoneType' object has no attribute 'sql'` 
from the first `_spark.sql(...)` call instead of the helpful 
`RuntimeError("SparkSession not initialized. Call get_spark_session() first.")` 
that `ls()` raises.
   
   Could you factor the None-check into a small helper (or just inline it at 
the top of each of these functions) for parity?
   
   ```python
   def _require_spark():
       if _spark is None:
           raise RuntimeError("SparkSession not initialized. Call 
get_spark_session() first.")
       return _spark
   ```



##########
hudi-notebooks/Dockerfile.spark4:
##########
@@ -0,0 +1,76 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+ARG SPARK_VERSION=${SPARK_VERSION:-4.0.2}
+ARG JAVA_VERSION=${JAVA_VERSION:-17}
+ARG SCALA_VERSION=${SCALA_VERSION:-2.13}
+ARG PYTHON_VERSION=${PYTHON_VERSION:-3}
+FROM 
apache/spark:$SPARK_VERSION-scala$SCALA_VERSION-java$JAVA_VERSION-python$PYTHON_VERSION-ubuntu
+
+USER root
+
+ARG SPARK_VERSION=${SPARK_VERSION:-4.0.2} \
+    HADOOP_VERSION=${HADOOP_VERSION:-3.4.1} \
+    AWS_SDK_V2_VERSION=${AWS_SDK_V2_VERSION:-2.24.6} \
+    MVN_REPO_URL=https://repo1.maven.org/maven2 \
+    HUDI_VERSION=${HUDI_VERSION:-1.1.1} \
+    SCALA_VERSION=${SCALA_VERSION:-2.13} \
+    HUDI_HOME=${HUDI_HOME:-/opt/hudi} \
+    NOTEBOOK_HOME=${NOTEBOOK_HOME:-/opt/notebooks}
+
+ENV SPARK_VERSION=$SPARK_VERSION \
+    SCALA_VERSION=$SCALA_VERSION \
+    HUDI_VERSION=$HUDI_VERSION \
+    PATH=$SPARK_HOME/bin:$PATH \
+    NOTEBOOK_HOME=$NOTEBOOK_HOME \
+    HUDI_HOME=$HUDI_HOME
+
+ARG SPARK_MINOR_VERSION=${SPARK_VERSION%.*}
+ARG HUDI_SPARK_BUNDLE=hudi-spark${SPARK_MINOR_VERSION}-bundle_$SCALA_VERSION
+ARG HUDI_SPARK_BUNDLE_JAR=$HUDI_SPARK_BUNDLE-$HUDI_VERSION.jar
+
+RUN mkdir -p ${HUDI_HOME}/${HUDI_VERSION} $NOTEBOOK_HOME && \
+    wget -O $SPARK_HOME/jars/hadoop-aws.jar \
+        
$MVN_REPO_URL/org/apache/hadoop/hadoop-aws/$HADOOP_VERSION/hadoop-aws-$HADOOP_VERSION.jar
 && \
+    wget -O $SPARK_HOME/jars/aws-sdk-bundle.jar \
+        
$MVN_REPO_URL/software/amazon/awssdk/bundle/$AWS_SDK_V2_VERSION/bundle-$AWS_SDK_V2_VERSION.jar
 && \
+    wget -O ${HUDI_HOME}/${HUDI_VERSION}/${HUDI_SPARK_BUNDLE_JAR} \
+        
$MVN_REPO_URL/org/apache/hudi/${HUDI_SPARK_BUNDLE}/${HUDI_VERSION}/${HUDI_SPARK_BUNDLE_JAR}
+
+COPY requirements.txt /tmp/requirements.txt
+COPY requirements-spark4.txt /tmp/requirements-spark4.txt
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends python3-pip cargo && \
+    pip3 install --upgrade pip && \
+    pip3 install --no-cache-dir -r /tmp/requirements.txt && \
+    pip3 install --no-cache-dir -r /tmp/requirements-spark4.txt && \
+    ln -sf /usr/bin/python3 /usr/bin/python && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/* /tmp/requirements.txt 
/tmp/requirements-spark4.txt
+
+COPY notebooks/common/*.ipynb $NOTEBOOK_HOME/
+COPY notebooks/common/images/ $NOTEBOOK_HOME/images/
+COPY notebooks/spark4/utils.py $NOTEBOOK_HOME/utils.py
+COPY notebooks/spark4/*.ipynb $NOTEBOOK_HOME/
+COPY conf/spark4/ $SPARK_HOME/conf/
+COPY conf/hadoop/core-site.xml $SPARK_HOME/conf/core-site.xml
+COPY entrypoint.sh $SPARK_HOME/entrypoint.sh
+RUN chmod +x /opt/spark/entrypoint.sh

Review Comment:
   nit: this uses a hard-coded `/opt/spark/entrypoint.sh` path while the 
surrounding code at line 71 uses `$SPARK_HOME/entrypoint.sh`. Works because 
`$SPARK_HOME=/opt/spark` in the base image, but inconsistent — `RUN chmod +x 
$SPARK_HOME/entrypoint.sh` matches the previous line.



##########
hudi-notebooks/build.sh:
##########
@@ -17,14 +17,23 @@
 
 set -euo pipefail
 
-export HUDI_VERSION=${HUDI_VERSION:-1.0.2}
-export HUDI_VERSION_TAG=${HUDI_VERSION}
+export SPARK_HUDI_VERSION=${SPARK_HUDI_VERSION:-1.0.2}
+export SPARK_HUDI_VERSION_TAG=${SPARK_HUDI_VERSION}
 export SPARK_VERSION=${SPARK_VERSION:-3.4.4}
+
+# Spark 4 stack (Java 17 + Scala 2.13 + Hadoop 3.4.x + AWS SDK v2)
+export SPARK4_HUDI_VERSION=${SPARK4_HUDI_VERSION:-1.1.1}

Review Comment:
   Hudi version is hard-coded in four places: `build.sh:20` 
(`SPARK_HUDI_VERSION=1.0.2`) and `build.sh:25` (`SPARK4_HUDI_VERSION=1.1.1`), 
`Dockerfile.spark4:29`, and `notebooks/spark4/utils.py:90` 
(`os.getenv("HUDI_VERSION", "1.1.1")`) — plus the Spark 3 fallback at 
`spark3/utils.py:79` (`1.0.2`). When 1.2.0 (or 1.1.2) ships, these will drift 
in at least four places.
   
   Worth a single source of truth — e.g. a `.env` file at the top of 
`hudi-notebooks/` sourced by `build.sh` and passed as `--build-arg` to both 
Dockerfiles, so the `utils.py` fallback simply reads `HUDI_VERSION` from env 
(which it already does). Not a blocker; would help with the next Hudi release.



##########
hudi-notebooks/Dockerfile.spark4:
##########
@@ -0,0 +1,76 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+ARG SPARK_VERSION=${SPARK_VERSION:-4.0.2}
+ARG JAVA_VERSION=${JAVA_VERSION:-17}
+ARG SCALA_VERSION=${SCALA_VERSION:-2.13}
+ARG PYTHON_VERSION=${PYTHON_VERSION:-3}
+FROM 
apache/spark:$SPARK_VERSION-scala$SCALA_VERSION-java$JAVA_VERSION-python$PYTHON_VERSION-ubuntu
+
+USER root
+
+ARG SPARK_VERSION=${SPARK_VERSION:-4.0.2} \
+    HADOOP_VERSION=${HADOOP_VERSION:-3.4.1} \
+    AWS_SDK_V2_VERSION=${AWS_SDK_V2_VERSION:-2.24.6} \
+    MVN_REPO_URL=https://repo1.maven.org/maven2 \
+    HUDI_VERSION=${HUDI_VERSION:-1.1.1} \
+    SCALA_VERSION=${SCALA_VERSION:-2.13} \
+    HUDI_HOME=${HUDI_HOME:-/opt/hudi} \
+    NOTEBOOK_HOME=${NOTEBOOK_HOME:-/opt/notebooks}
+
+ENV SPARK_VERSION=$SPARK_VERSION \
+    SCALA_VERSION=$SCALA_VERSION \
+    HUDI_VERSION=$HUDI_VERSION \
+    PATH=$SPARK_HOME/bin:$PATH \
+    NOTEBOOK_HOME=$NOTEBOOK_HOME \
+    HUDI_HOME=$HUDI_HOME
+
+ARG SPARK_MINOR_VERSION=${SPARK_VERSION%.*}
+ARG HUDI_SPARK_BUNDLE=hudi-spark${SPARK_MINOR_VERSION}-bundle_$SCALA_VERSION

Review Comment:
   nit: this derives `hudi-spark${SPARK_MINOR_VERSION}-bundle_${SCALA_VERSION}` 
from `SPARK_MINOR_VERSION=${SPARK_VERSION%.*}`. If `SPARK_VERSION` is bumped to 
`4.1.x` later, the bundle becomes `hudi-spark4.1-bundle_2.13` — which fails the 
build unless Hudi has already shipped a 4.1-targeted bundle. Worth a short 
comment noting the constraint, or pin `HUDI_SPARK_BUNDLE` explicitly via a 
build-arg so the version dependency is obvious.



##########
hudi-notebooks/requirements-spark4.txt:
##########
@@ -0,0 +1,22 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# Spark 4 image only extras (installed in addition to requirements.txt).
+# hudi-rs: native Python binding for Apache Hudi (PyPI package name "hudi").
+hudi>=0.4.0

Review Comment:
   nit: `hudi>=0.4.0` allows any hudi-rs version, including future breaking 
releases. Consider a soft upper bound (`hudi>=0.4.0,<1.0`) so a major hudi-rs 
bump doesn't silently break the Spark 4 notebook's `08_hudi_rs_example.ipynb`.



##########
hudi-notebooks/conf/spark4/hudi-defaults.conf:
##########
@@ -0,0 +1,22 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+hoodie.datasource.write.table.type          COPY_ON_WRITE

Review Comment:
   nit: the existing `conf/spark/hudi-defaults.conf` and the rest of Hudi 
convention separate keys and values with `=` or a single space; this file uses 
a column of aligned spaces, which is fine but inconsistent. Cosmetic — feel 
free to leave.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to