Spark + Hive 4 Integration Guide (Practical Approach)
1.     Overview

This write-up was motivated by my personal use of Hive as a stable and
reliable data warehouse for storing my bank records, which span over a
decade. Migrating away from Hive would have been both time-consuming and
impractical, so I needed to find a way to upgrade my lab
environment—including Hadoop, Hive, and Spark—while preserving the existing
architecture.

The objective was to identify a cost-effective and practical solution that
maintains the current setup, which relies on Airflow as the workflow
orchestration tool to populate Hive tables on demand.

This document explains how Apache Spark interacts with Apache Hive 4 in a
modern architecture. Due to incompatibility between Spark (3.x) and the
Hive 4 metastore APIs, a decoupled approach is required. The environment
used consists of *Spark 3.5.5, Hadoop 3.4.1, and Hive 4.0.1*.
2. Architecture Diagram

            +------------------+
            |      Spark       |
            |  (Processing)    |
            +--------+---------+
                     |
                     | Write (Parquet)
                     v
            +------------------+
            |      HDFS        |
            | (Data Storage)   |
            +--------+---------+
                     |
                     | Read
                     v
            +------------------+
            |      Hive        |
            | (SQL / Metadata) |
            +------------------+

3. Key Principle

Spark does NOT talk to Hive metastore directly.
Instead:
- Spark writes data to HDFS
- Hive reads data from HDFS using external tables
4. Real Example (Based on My Pipeline)

Spark transformation and write:

val df = spark.read
  .option("inferSchema", "true")
  .option("header", "false")
  .csv("hdfs://rhes75:9000/data/stg/...")

val finalDf = df
  .filter($"_c0" =!= "Date")
  .select(
    to_date($"_c0","dd/MM/yyyy").as("transactiondate"),
    $"_c1".as("description")
  )

finalDf.write.mode("overwrite")
  .parquet("hdfs://.../parquet/path")

Hive table creation via JDBC:

CREATE EXTERNAL TABLE elayer.bank_123456_stg (
  transactiondate DATE,
  description STRING
)
STORED AS PARQUET
LOCATION 'hdfs://.../parquet/path';
5. Hive JDBC Example

Class.forName("org.apache.hive.jdbc.HiveDriver")
val conn = DriverManager.getConnection("jdbc:hive2://rhes75:10099/default")
val stmt = conn.createStatement()
stmt.execute("SELECT COUNT(*) FROM table")
6. Troubleshooting (Based on My Issues)

Problem: Invalid method name 'get_table'

Cause: Spark trying to use Hive 4 metastore (not supported)

Problem: No data in staging table

Cause: Table already existed with old LOCATION

Fix:

DROP TABLE elayer.bank_123456_stg before recreating

Problem: HDFS write errors (map.xml)

Cause: DataNode capacity / replication issue
7. Best Practices

- Always use EXTERNAL tables for Spark-written data

- Always control LOCATION explicitly

- Drop & recreate staging tables when path changes

- Keep Spark and Hive responsibilities separate
8. Summary

Spark + Hive 4 integration works reliably when using HDFS as the bridge.
Avoid direct metastore usage from Spark.
HTH

Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial Forensics &
Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
Analytics

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Reply via email to