[GitHub] [kylin] Rongnengwei opened a new pull request #996: KYLIN-4206 Add kylin supports aws glue catalog metastroeclient

GitBox Thu, 02 Jan 2020 19:41:24 -0800

Rongnengwei opened a new pull request #996: KYLIN-4206 Add kylin supports aws 
glue catalog metastroeclient
URL: https://github.com/apache/kylin/pull/996
 
 
   This modification mainly solves the problem of aw glue catalog supported by 
kylin, and the associated jira is 
[https://issues.apache.org/jira/browse/KYLIN-4206](https://issues.apache.org/jira/browse/KYLIN-4206)。
   1.First you need to modify the aws-glue-data-catalog-client  source code.
   aws-glue-data-catalog-client-for-apache-hive-metastore github address is 
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore，aws-glue-client
 development environment see README.MD.
   I downloaded hive 2.3.7 locally, so after following the steps in the 
README.MD file, the version of hive is 2.3.7-SNAPSHOT.
   1）Modify the pom.xml file in the home directory.
   ```
   <hive2.version>2.3.7-SNAPSHOT</hive2.version> 
   <spark-hive.version>1.2.1.spark2</spark-hive.version>
   ```
   2)Modify the  class of 
aws-glue-datacatalog-hive2-client/com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient
   
![image](https://user-images.githubusercontent.com/6345909/70910510-d0bdf480-204a-11ea-86b3-aad84842a063.png)
   
   Implementation method 
   ```
   @Override
     public PartitionValuesResponse listPartitionValues (PartitionValuesRequest 
partitionValuesRequest) throws MetaException, TException, NoSuchObjectException 
{
       return null;
     }
   ```
   
![image](https://user-images.githubusercontent.com/6345909/70910545-e501f180-204a-11ea-9998-a0da6fdf57bf.png)
   
   3)Modify the  class of 
aws-glue-datacatalog-spark-client/com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.
   The problems are as follows：
   
![image](https://user-images.githubusercontent.com/6345909/70910558-eb906900-204a-11ea-85e8-b4670cda19a1.png)
   
   This method is not available in the parent class，so  delete the method，Then 
copy the method   of aws-glue-datacatalog-hive2-client / 
com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.Add dependency 
in aws-glue-datacatalog-spark-client / pom.xml file
   ```
   <dependency>
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-exec</artifactId>
       <version>${hive2.version}</version>
       <scope>provided</scope>
   </dependency>
   ```
   4）Package，need to package three projects,as follows.
   
![image](https://user-images.githubusercontent.com/6345909/70910599-fe0aa280-204a-11ea-8efc-83fccb3789f5.png)
   
![image](https://user-images.githubusercontent.com/6345909/70910613-06fb7400-204b-11ea-89c9-08e4abd7e0cc.png)
   
![image](https://user-images.githubusercontent.com/6345909/70910626-0bc02800-204b-11ea-92df-a133327ffe62.png)
   
   
   
   5).Copy the three package
   ```
   aws-glue-datacatalog-client-common-1.10.0-SNAPSHOT.jar  
   aws-glue-datacatalog-hive2-client-1.10.0-SNAPSHOT.jar  
   aws-glue-datacatalog-spark-client-1.10.0-SNAPSHOT.jar
   ```
   to /kylin/lib
   2.Modify the source code of kylin,See submission of PR.
   1)Add the gluecatalog in the config  of  kylin.properties. 
   ```
   ##The default access HiveMetastoreClient is hcatalog. If AWS user and glue 
catalog is used, it can be configured as gluecatalog
   ##kylin.source.hive.metadata-type=hcatalog
   ```
   The default is hcatalog. If you want to use glue, please configure 
kylin.source.hive.metadata-type = gluecatalog.
   if config gluecatalog,so need to configure in hive-site.xml，as follows:
     <property>
       <name>hive.metastore.client.factory.class</name>    
<value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
     </property>
   
   
   ### **kylin   install   on EMR** 
   
   Note: The new cluster uses the metadata of the S3 storage hbase. So you need 
to build the path of S3
   s3://*******/hbase
   s3://*******/kylin
   1. Upload .tar.gz  to  Server 
   2.Build directory
   Use root user 
   cd  /usr/local
   mkdir  -p kylin 
   Copy the kylin package to
    /usr/local/kylin/apache-kylin-3.0.0-SNAPSHOT-bin 
   Modify the group the folder belongs to
   chown  -R  hadoop:hadoop  /usr/local/kylin 
   Switch user
   sudo su hadoop  
   3. Configure user global variables
   vi .bashrc 
   #####hadoop spark  kylin######### 
   export HIVE_HOME=/usr/lib/hive 
   export HADOOP_HOME=/usr/lib/hadoop 
   export HBASE_HOME=/usr/lib/hbase 
   export SPARK_HOME=/usr/lib/spark 
   ##kylin 
   export KYLIN_HOME=/usr/local/kylin/apache-kylin-3.0.0-SNAPSHOT-bin 
   export HCAT_HOME=/usr/lib/hive-hcatalog 
   export KYLIN_CONF_HOME=$KYLIN_HOME/conf 
   export tomcat_root=$KYLIN_HOME/tomcat 
   export 
hive_dependency=$HIVE_HOME/conf:$HIVE_HOME/lib/*:$HIVE_HOME/lib/hive-hcatalog-core.jar:$SPARK_HOME/jars/*
 
   export PATH=$KYLIN_HOME/bin:$PATH 
   source   .bashrc 
   4.Configure kylin.properties
   The jackson-datatype-joda-2.4.6.jar package that comes with hive conflicts 
with jackson-datatype-joda-2.9.5.jar in kylin / tomcat / weapps / kylin / 
WEB-INF / lib, so you need to modify the hive Bag
   mv $HIVE_HOME/lib/jackson-datatype-joda-2.4.6.jar 
$HIVE_HOME/lib/jackson-datatype-joda-2.4.6.jar.back 
   1) Configure the jar package to be run by the spark engine. This jar package 
needs to be generated and uploaded to HDFS.
   New /usr/local/kylin/apache-kylin-3.0.0-SNAPSHOT-bin/spark_jars
   Copy the required jar packages to the spark_jars directory  
    cp  /usr/lib/spark/jars/*.jar   cp /usr/lib/hbase/lib/*.jar     
/usr/local/kylin/spark_jars 
   Remove lower version of netr-all-4.1.17.Final.jar in spark_jars
   ##### Package conflict
   Remove the lower version of netr-all-4.1.17.Final.jar in spark_jars
   package
   jar cv0f spark-libs.jar -C /usr/local/kylin/spark_jars/ ./
   Configure mr dependent packages
   cp  /usr/lib/hive/lib/*.jar   cp /usr/lib/hbase/lib/*.jar    
/usr/local/kylin/mr_lib 
   Remove the lower version of netr-all-4.1.17.Final.jar in mr_lib
   kylin.job.mr.lib.dir=/usr/local/kylin/mr_lib 
   2) Upload spark-libs.ja to hdfs
   hdfs dfs -mkdir -p /kylin/spark/ 
   hdfs dfs -put spark-libs.jar /kylin/spark
   ## The metadata store in hbase 
   kylin.metadata.url=kylin_metadata@hbase 
   ## Working folder in HDFS, better be qualified absolute path, make sure user 
has the right permission to this directory 
   kylin.env.hdfs-working-dir=s3://*******/kylin 
   ## DEV|QA|PROD. DEV will turn on some dev features, QA and PROD has no 
difference in terms of functions. 
   kylin.env=DEV 
   ##The default access HiveMetastoreClient is hcatalog. If AWS user and glue 
catalog is used, it can be configured as gluecatalog
   kylin.source.hive.metadata-type=hcatalog
   ## kylin zk base path 
   kylin.env.zookeeper-base-path=/kylin 
   ## Kylin server mode, valid value [all, query, job] 
   kylin.server.mode=all 
   ## Display timezone on UI,format like[GMT+N or GMT-N] 
   kylin.web.timezone=GMT+8 
   ## Hive database name for putting the intermediate flat tables 
   kylin.source.hive.database-for-flat-table=kylin_flat_db_test1 
   ## Whether redistribute the intermediate flat table before building 
   kylin.source.hive.redistribute-flat-table=false 
   ## The storage for final cube file in hbase 
   kylin.storage.url=hbase 
   ## The prefix of hbase table 
   kylin.storage.hbase.table-name-prefix=KYLIN_     
   ## HBase Cluster FileSystem, which serving hbase, format as 
hdfs://hbase-cluster:8020 
   ## Leave empty if hbase running on same cluster with hive and mapreduce 
   kylin.storage.hbase.cluster-fs=s3://*************/hbase 
   #### SPARK ENGINE CONFIGS ###
   # 
   ## Hadoop conf folder, will export this as "HADOOP_CONF_DIR" to run 
spark-submit 
   ## This must contain site xmls of core, yarn, hive, and hbase in one folder 
   kylin.env.hadoop-conf-dir=/etc/hadoop/conf 
   #kylin.engine.spark.max-partition=5000 
   # 
   ## Spark conf (default is in spark/conf/spark-defaults.conf) 
   kylin.engine.spark-conf.spark.master=yarn 
   kylin.engine.spark-conf.spark.submit.deployMode=cluster 
   kylin.engine.spark-conf.spark.yarn.queue=default 
   kylin.engine.spark-conf.spark.driver.memory=2G 
   kylin.engine.spark-conf.spark.executor.memory=18G 
   kylin.engine.spark-conf.spark.executor.instances=6 
   kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024 
   kylin.engine.spark-conf.spark.shuffle.service.enabled=true 
   kylin.engine.spark-conf.spark.eventLog.enabled=true 
   ## manually upload spark-assembly jar to HDFS and then set this property 
will avoid repeatedly uploading jar at runtime 
   
kylin.engine.spark-conf.spark.yarn.archive=hdfs://ip-172-*****:8020/kylin/spark/spark-libs.jar
 
   
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
   ############define########### 
   
kylin.job.jar=/usr/local/kylin/apache-kylin-3.0.0-SNAPSHOT-bin/lib/kylin-job-3.0.0-SNAPSHOT.jar
 
   
kylin.coprocessor.local.jar=/usr/local/kylin/apache-kylin-3.0.0-SNAPSHOT-bin/lib/kylin-coprocessor-3.0.0-SNAPSHOT.jar
 
   kylin.job.mr.lib.dir=/usr/local/kylin/mr_lib 
   kylin.query.max-return-rows=10000000
   5. Configure * .XML file
   kylin_job_conf.xml, configure zk address, zk address is obtained from hbase 
/ conf / hbase-site.xml
   <property> 
       <name>hbase.zookeeper.quorum</name> 
       <value>ip-172-40-***.ec2.internal</value> 
     </property> 
   <property> 
           <name>mapreduce.task.timeout</name> 
           <value>3600000</value> 
           <description>Set task timeout to 1 hour</description> 
       </property> 
   kylin_hive_conf.xml, configure metastore address
   <property> 
      <name>hive.metastore.uris</name> 
      <value>thrift://ip-17*******.internal:9083</value> 
   </property> 
   Added at the beginning of the kylin.sh file
   export 
HBASE_CLASSPATH_PREFIX=${tomcat_root}/bin/bootstrap.jar:${tomcat_root}/bin/tomcat-juli.jar:${tomcat_root}/lib/*:$hive_dependency:$HBASE_CLASSPATH_PREFIX
 
   /usr/lib/hive/conf/hive-site.xml，Configure the use of aws glue to share data 
sources
     <property> 
       <name>hive.metastore.client.factory.class</name> 
       
<value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
 
     </property> 
   6. Configure hbase rpc timeout
   Add in hbase-site.xml file:
   <property> 
       <name>hbase.rpc.timeout</name> 
       <value>3600000</value> 
     </property> 
   7. Start hadoop historyserver
   ./mr-jobhistory-daemon.sh     start     historyserver   
   8. Start the kylin service
   ./kylin.sh start 
   
![image](https://user-images.githubusercontent.com/6345909/70910788-68234780-204b-11ea-866c-80a8dbc84422.png)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [kylin] Rongnengwei opened a new pull request #996: KYLIN-4206 Add kylin supports aws glue catalog metastroeclient

Reply via email to