soumilshah1995 commented on issue #10287:
URL: https://github.com/apache/hudi/issues/10287#issuecomment-1850094087

   I explored two pathways in addressing this challenge. The first route 
involved initiating the process using pure Hive and Spark SQL, coupled with 
Thrift Server. However, when attempting to run dbt, I encountered a specific 
issue.
   
   The second route, a more intricate and time-consuming approach, required the 
installation of Apache Derby and Spark. Despite my efforts, an odd complication 
arose: when executing "dbt run," it led to the unexpected crash of both Thrift 
Server and Apache Derby.
   
   ## Step 1: Create DBT Environment
   
   ```bash
   # Create virtual environment for DBT
   python -m venv dbt-env
   source dbt-env/bin/activate
   
   # Install required packages
   pip install dbt-core
   pip install dbt-spark
   pip install 'dbt-spark[PyHive]'
   
   # Navigate to DBT directory
   cd ~/.dbt/
   
   # Set Java environment variable
   export 
JAVA_HOME=/opt/homebrew/Cellar/openjdk@11/11.0.21/libexec/openjdk.jdk/Contents/Home
   
   # Step 2: Download and Run Apache Derby
   ```
    install Apache Derby
   
   export DERBY_VERSION=10.14.2.0
   curl -O 
https://archive.apache.org/dist/db/derby/db-derby-$DERBY_VERSION/db-derby-$DERBY_VERSION-bin.tar.gz
 -P /opt/
   tar -xf db-derby-$DERBY_VERSION-bin.tar.gz
   export DERBY_HOME=/Users/soumilshah/Desktop/soumil/dbt/db-derby-10.14.2.0-bin
   echo $DERBY_HOME
   rm -r db-derby-10.14.2.0-bin.tar.gz
   $DERBY_HOME/bin/startNetworkServer -h localhost
   ```
   
   
   Step 3: Install Apache Spark
   ```
   # Specify Spark version
   export SPARK_VERSION=3.2.3
   
   # Download and extract Spark
   curl -O 
https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop2.7.tgz
   tar -xf spark-$SPARK_VERSION-bin-hadoop2.7.tgz
   
   # Set Spark home
   export 
SPARK_HOME=/Users/soumilshah/Desktop/soumil/dbt/spark-3.2.3-bin-hadoop2.7
   echo $SPARK_HOME
   
   # Clean up downloaded files
   rm spark-3.2.3-bin-hadoop2.7.tgz
   
   ```
   Step 4: Copy JAR files
   
   ```
   # Copy JAR files to Spark JARS directory
   cp /Users/soumilshah/Desktop/myjar/*.jar $SPARK_HOME/jars/
   
   ```
   Step 5: Spark Submit Configuration
   
   ```
   # Submit Spark job
   spark-submit \
     --master 'local[*]' \
     --conf spark.executor.extraJavaOptions=-Duser.timezone=Etc/UTC \
     --conf spark.eventLog.enabled=false \
     --conf spark.sql.warehouse.dir=file:///Users/soumilshah/Desktop/soumil/dbt 
\
     --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 \
     --packages 
'org.apache.spark:spark-sql_2.12:3.2.3,org.apache.spark:spark-hive_2.12:3.2.3,org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0'
 \
     --name "Thrift JDBC/ODBC Server" \
     --executor-memory 5g \
     --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
     --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
     --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
\
     --conf hive.metastore.warehouse.dir=/Users/soumilshah/Desktop/soumil/dbt \
     --conf hive.metastore.schema.verification=false \
     --conf datanucleus.schema.autoCreateAll=true \
     --conf 
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver \
     --conf 
'javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/MyDatabase;create=true'
   
   
   ```
   
   
   Question why do I have to use apache derby I mean when I am simply using 
Spark SQL and hive server I am able to create Hudi Tables through beeline why 
does it fails on dbt run
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to