soumilshah1995 commented on issue #10287:
URL: https://github.com/apache/hudi/issues/10287#issuecomment-1850094087
I explored two pathways in addressing this challenge. The first route
involved initiating the process using pure Hive and Spark SQL, coupled with
Thrift Server. However, when attempting to run dbt, I encountered a specific
issue.
The second route, a more intricate and time-consuming approach, required the
installation of Apache Derby and Spark. Despite my efforts, an odd complication
arose: when executing "dbt run," it led to the unexpected crash of both Thrift
Server and Apache Derby.
## Step 1: Create DBT Environment
```bash
# Create virtual environment for DBT
python -m venv dbt-env
source dbt-env/bin/activate
# Install required packages
pip install dbt-core
pip install dbt-spark
pip install 'dbt-spark[PyHive]'
# Navigate to DBT directory
cd ~/.dbt/
# Set Java environment variable
export
JAVA_HOME=/opt/homebrew/Cellar/openjdk@11/11.0.21/libexec/openjdk.jdk/Contents/Home
# Step 2: Download and Run Apache Derby
```
install Apache Derby
export DERBY_VERSION=10.14.2.0
curl -O
https://archive.apache.org/dist/db/derby/db-derby-$DERBY_VERSION/db-derby-$DERBY_VERSION-bin.tar.gz
-P /opt/
tar -xf db-derby-$DERBY_VERSION-bin.tar.gz
export DERBY_HOME=/Users/soumilshah/Desktop/soumil/dbt/db-derby-10.14.2.0-bin
echo $DERBY_HOME
rm -r db-derby-10.14.2.0-bin.tar.gz
$DERBY_HOME/bin/startNetworkServer -h localhost
```
Step 3: Install Apache Spark
```
# Specify Spark version
export SPARK_VERSION=3.2.3
# Download and extract Spark
curl -O
https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop2.7.tgz
tar -xf spark-$SPARK_VERSION-bin-hadoop2.7.tgz
# Set Spark home
export
SPARK_HOME=/Users/soumilshah/Desktop/soumil/dbt/spark-3.2.3-bin-hadoop2.7
echo $SPARK_HOME
# Clean up downloaded files
rm spark-3.2.3-bin-hadoop2.7.tgz
```
Step 4: Copy JAR files
```
# Copy JAR files to Spark JARS directory
cp /Users/soumilshah/Desktop/myjar/*.jar $SPARK_HOME/jars/
```
Step 5: Spark Submit Configuration
```
# Submit Spark job
spark-submit \
--master 'local[*]' \
--conf spark.executor.extraJavaOptions=-Duser.timezone=Etc/UTC \
--conf spark.eventLog.enabled=false \
--conf spark.sql.warehouse.dir=file:///Users/soumilshah/Desktop/soumil/dbt
\
--class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 \
--packages
'org.apache.spark:spark-sql_2.12:3.2.3,org.apache.spark:spark-hive_2.12:3.2.3,org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0'
\
--name "Thrift JDBC/ODBC Server" \
--executor-memory 5g \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
\
--conf hive.metastore.warehouse.dir=/Users/soumilshah/Desktop/soumil/dbt \
--conf hive.metastore.schema.verification=false \
--conf datanucleus.schema.autoCreateAll=true \
--conf
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver \
--conf
'javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/MyDatabase;create=true'
```
Question why do I have to use apache derby I mean when I am simply using
Spark SQL and hive server I am able to create Hudi Tables through beeline why
does it fails on dbt run
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]