Spark + Hive 4 Integration Guide (Practical Approach)
1. Overview
This write-up was motivated by my personal use of Hive as a stable and
reliable data warehouse for storing my bank records, which span over a
decade. Migrating away from Hive would have been both time-consuming and
impractical, so I needed to find a way to upgrade my lab
environment—including Hadoop, Hive, and Spark—while preserving the existing
architecture.
The objective was to identify a cost-effective and practical solution that
maintains the current setup, which relies on Airflow as the workflow
orchestration tool to populate Hive tables on demand.
This document explains how Apache Spark interacts with Apache Hive 4 in a
modern architecture. Due to incompatibility between Spark (3.x) and the
Hive 4 metastore APIs, a decoupled approach is required. The environment
used consists of *Spark 3.5.5, Hadoop 3.4.1, and Hive 4.0.1*.
2. Architecture Diagram
+------------------+
| Spark |
| (Processing) |
+--------+---------+
|
| Write (Parquet)
v
+------------------+
| HDFS |
| (Data Storage) |
+--------+---------+
|
| Read
v
+------------------+
| Hive |
| (SQL / Metadata) |
+------------------+
3. Key Principle
Spark does NOT talk to Hive metastore directly.
Instead:
- Spark writes data to HDFS
- Hive reads data from HDFS using external tables
4. Real Example (Based on My Pipeline)
Spark transformation and write:
val df = spark.read
.option("inferSchema", "true")
.option("header", "false")
.csv("hdfs://rhes75:9000/data/stg/...")
val finalDf = df
.filter($"_c0" =!= "Date")
.select(
to_date($"_c0","dd/MM/yyyy").as("transactiondate"),
$"_c1".as("description")
)
finalDf.write.mode("overwrite")
.parquet("hdfs://.../parquet/path")
Hive table creation via JDBC:
CREATE EXTERNAL TABLE elayer.bank_123456_stg (
transactiondate DATE,
description STRING
)
STORED AS PARQUET
LOCATION 'hdfs://.../parquet/path';
5. Hive JDBC Example
Class.forName("org.apache.hive.jdbc.HiveDriver")
val conn = DriverManager.getConnection("jdbc:hive2://rhes75:10099/default")
val stmt = conn.createStatement()
stmt.execute("SELECT COUNT(*) FROM table")
6. Troubleshooting (Based on My Issues)
Problem: Invalid method name 'get_table'
Cause: Spark trying to use Hive 4 metastore (not supported)
Problem: No data in staging table
Cause: Table already existed with old LOCATION
Fix:
DROP TABLE elayer.bank_123456_stg before recreating
Problem: HDFS write errors (map.xml)
Cause: DataNode capacity / replication issue
7. Best Practices
- Always use EXTERNAL tables for Spark-written data
- Always control LOCATION explicitly
- Drop & recreate staging tables when path changes
- Keep Spark and Hive responsibilities separate
8. Summary
Spark + Hive 4 integration works reliably when using HDFS as the bridge.
Avoid direct metastore usage from Spark.
HTH
Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial Forensics &
Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
Analytics
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>