Re: Griffin on Docker - modify Hive metastore Uris

Lionel Liu Wed, 09 May 2018 18:46:37 -0700

Hi Enrico,

At current, UI can only render accuracy metrics as a chart, for profiling
metrics, the metrics could be displayed as a table.
However, only if you created the measure and scheduled a job through API,
you can get the profiling metric table on "Jobs" page, by clicking the
"metric" button of the job you've scheduled.
Furthermore, you can also create your own UI or just leverage Kibana to
show your metrics as you wish.


Thanks,
Lionel

On Wed, May 9, 2018 at 11:43 PM, Enrico D'Urso <[email protected]> wrote:

> Hi Lionel,
>
> Thanks for the suggestions they worked.
> The UI suffers a bit to visualize the data, but I think it depends on the
> fact I am running everything on Docker!
>
> To sum up:
> So far, I have been able to connect to S3 (data are there no HDFS in my
> case). Also, I have been able to create some ‘complex’ profiling job,
> creating manually the conf.json using SparkSql as dialect.
> The results are indeed available in ES and HDFS as defined in env.json.
> Cool.
>
> Now, my question is: As I have the new metrics in ES, can I force somehow
> the UI to visualize the metrics I created?
>
> Thanks,
>
> Enrico
>
> On 5/4/18, 2:43 AM, "Lionel Liu" <[email protected]> wrote:
>
>     Hi Enrico,
>
>     If you've modified hive-site.xml, you need to update it in hdfs by
> this,
>     because we've set "spark.yarn.dist.files
>     hdfs:///home/spark_conf/hive-site.xml" in spark-defaults.conf:
>     hadoop fs -rm /home/spark_conf/hive-site.xml
>     hadoop fs -put $HIVE_HOME/conf/hive-site.xml /home/spark_conf/
>
>     Then, you need to restart the griffin service if you modified
>     application.properties
>     of griffin, then it will re-read the application.properties.
>     1. Get pid of griffin service:
>     ps -ef | grep "service.jar"
>     2. kill pid of griffin service:
>     kill -9 <pid>
>     3. start griffin service:
>     cd ~/service/
>     nohup java -jar service.jar > service.log &
>
>     After about 2 minutes, the service starts up, you can refresh UI.
>
>     Thanks,
>     Lionel
>
>     On Thu, May 3, 2018 at 7:00 PM, Enrico D'Urso <[email protected]>
> wrote:
>
>     > Hi,
>     >
>     > I think I fixed the S3 issue,
>     > Basically, I added the following line in
>     > /apache/hadoop-2.6.5/etc/hadoop/core-site.xml :
>     >
>     > <property>
>     >   <name>fs.s3.impl</name>
>     >   <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
>     > </property>
>     >
>     > Now, the NOT FOUND error is gone!
>     >
>     > I think this should be enough to start playing with Griffin using
> our real
>     > data.
>     > Fix me If I wrong, but I think I can create my own measure config
> file and
>     > then submit the job to Spark.
>     > My question here is:
>     > I modified the metastore URI in hive-site.xml (hive directory),
>     > hive-site.xml (Spark conf directory), and finally also in
>     > /root/service/config/application.properties
>     > but in the UI I still see the old data. Do I need to restart some
> griffin
>     > services to force it to re-read the above config files?
>     >
>     > Thanks,
>     >
>     > Enrico
>     >
>     > On 4/26/18, 9:41 AM, "Enrico D'Urso" <[email protected]> wrote:
>     >
>     >     Hi,
>     >
>     >     That is ok, no problem.
>     >     I will update you if I am able to fix the issue.
>     >
>     >     Thanks,
>     >
>     >     Enrico
>     >
>     >     On 4/26/18, 2:43 AM, "William Guo" <[email protected]> wrote:
>     >
>     >         hi Enrico,
>     >
>     >         Honestly, It is a little difficult for us to setup the
> environment
>     > on aws
>     >         for now, since we are using hdfs.
>     >         But we will figure out how to support aws and post the
> status here.
>     >
>     >         For now, we are busing with release license issue.
>     >         After we have released 0.2.0. we will create a task for AWS.
>     >
>     >         BTW,
>     >
>     >         I am not sure whether this 'application/xml' is right or not
>     >
>     >         '''
>     >         18/04/25 14:13:02 DEBUG http.wire:  << "Content-Type:
>     >         application/xml[\r][\n]"
>     >         '''
>     >
>     >         Thanks,
>     >         William
>     >
>     >
>     >
>     >
>     >         On Wed, Apr 25, 2018 at 10:23 PM, Enrico D'Urso <
>     > [email protected]> wrote:
>     >
>     >         > Hi guys,
>     >         >
>     >         > Thank you for your email.
>     >         > My company is pretty interested in using Griffin (and maybe
>     > contribute to
>     >         > the code), but being able to use it with S3 (Aws in
> general)
>     > instead of
>     >         > HDFS is a crucial point.
>     >         > Let me share my configuration with you, I hope this can
> help to
>     > trouble
>     >         > shoot the issue. I believe that in case it does not, we can
>     > organize a call
>     >         > where I can share my screen.
>     >         >
>     >         > Let’s start with core-site.xml in the following directory:
>     >         > root@griffin:/apache/spark/conf#
>     >         >
>     >         > So it is the one that Spark uses. Here the complete xml:
>     >         > https://paste.ofcode.org/cfZFkRcGPsPshhew6X6HPL
>     >         > However, the important item is:
>     >         >
>     >         > <property>
>     >         >     <name>hive.metastore.uris</name>
>     >         >     <value>thrift://shared-XXXXX-
> dance.us-west-2.hcom-sandbox-
>     >         > aws.aws.hcom:48869</value>
>     >         >     <description>Thrift URI for the remote metastore. Used
> by
>     > metastore
>     >         > client to connect to remote metastore.</description>
>     >         >   </property>
>     >         >
>     >         > which works fine as I can see DBs and tables when using
>     > spark-shell.
>     >         >
>     >         > The second file I modified is core-site.xml here:
>     >         > root@griffin:/apache/hadoop-2.6.5/etc/hadoop#
>     >         > Complete file is here: https://paste.ofcode.org/
>     > k3HZqb6gEDhJd8XM9Pv45u
>     >         > But the important point is:
>     >         > <property>
>     >         > <name>fs.s3.awsAccessKeyId</name>
>     >         > <value>XXXXX</value>
>     >         > </property>
>     >         > <property>
>     >         > <name>fs.s3.awsSecretAccessKey</name>
>     >         > <value>XXXXXX</value>
>     >         > </property>
>     >         >
>     >         > the values are masked, but I can confirm that the values
> are
>     > correct, as
>     >         > it is able to authenticate with AWS.
>     >         >
>     >         > Finally, this is the way I run Spark-shell:
>     >         > spark-shell --deploy-mode client --master yarn
>     >         > --packages=org.apache.hadoop:hadoop-aws:2.6.5,
>     >         > com.amazonaws:aws-java-sdk:1.7.4
>     >         > please note the packages flag, which downloads the required
>     > packages to
>     >         > connect with AWS.
>     >         >
>     >         > Once that the spark-shell is opened I have no problem in
> viewing
>     > the DBs:
>     >         > sqlContext.sql("show databases").collect().foreach(
> println(_))
>     >         > It works and the result is correct.
>     >         > Then when I try to select any table:
>     >         > sqlContext.sql("Select * from XX.YY").take(2)
>     >         >
>     >         > I get the error:
>     >         > Caused by: java.util.concurrent.ExecutionException:
>     >         > java.io.FileNotFoundException: File s3://bucketName/XX/YY/
>     > sentdate=2018-01-14
>     >         > does not exist.
>     >         >         at java.util.concurrent.
> FutureTask.report(FutureTask.
>     > java:122)
>     >         >         at java.util.concurrent.FutureTask.get(FutureTask.
>     > java:192)
>     >         >         at org.apache.hadoop.hive.ql.io.
> orc.OrcInputFormat.
>     >         > generateSplitsInfo(OrcInputFormat.java:998)
>     >         >         ... 93 more
>     >         > Caused by: java.io.FileNotFoundException: File
>     > s3://hcom-data-prod-users/
>     >         > user_tech/email_testing/sentdate=2018-01-14 does not
> exist.
>     >         >         at org.apache.hadoop.fs.s3.
> S3FileSystem.listStatus(
>     >         > S3FileSystem.java:195)
>     >         >         at org.apache.hadoop.fs.FileSystem.listStatus(
>     >         > FileSystem.java:1485)
>     >         >         at org.apache.hadoop.fs.FileSystem.listStatus(
>     >         > FileSystem.java:1525)
>     >         >         at org.apache.hadoop.fs.FileSystem$4.<init>(
>     > FileSystem.java:1682)
>     >         >         at org.apache.hadoop.fs.
> FileSystem.listLocatedStatus(
>     >         > FileSystem.java:1681)
>     >         >         at org.apache.hadoop.fs.
> FileSystem.listLocatedStatus(
>     >         > FileSystem.java:1664)
>     >         >         at org.apache.hadoop.hive.shims.Hadoop23Shims.
>     > listLocatedStatus(
>     >         > Hadoop23Shims.java:667)
>     >         >         at org.apache.hadoop.hive.ql.io.
> AcidUtils.getAcidState(
>     >         > AcidUtils.java:361)
>     >         >         at org.apache.hadoop.hive.ql.io.
> orc.OrcInputFormat$
>     >         > FileGenerator.call(OrcInputFormat.java:634)
>     >         >         at org.apache.hadoop.hive.ql.io.
> orc.OrcInputFormat$
>     >         > FileGenerator.call(OrcInputFormat.java:620)
>     >         >         at java.util.concurrent.FutureTask.run(FutureTask.
>     > java:266)
>     >         >         at java.util.concurrent.
> ThreadPoolExecutor.runWorker(
>     >         > ThreadPoolExecutor.java:1142)
>     >         >         at java.util.concurrent.
> ThreadPoolExecutor$Worker.run(
>     >         > ThreadPoolExecutor.java:617)
>     >         >         at java.lang.Thread.run(Thread.java:748)
>     >         >
>     >         > in fact, enabling debug mode, I see the HTTP-header
> request and
>     > response:
>     >         >
>     >         >  18/04/25 14:13:02 DEBUG conn.DefaultClientConnection:
> Sending
>     > request:
>     >         > GET /XXX/YYY%2Fsentdate%3D2018-01-14 HTTP/1.1
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  >> "GET
>     > /%2Fuser_tech%2Femail_testing%2Fsentdate%3D2018-01-14
>     >         > HTTP/1.1[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  >> "Date: Wed, 25 Apr
> 2018
>     > 14:13:02
>     >         > GMT[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  >> "Host:
>     > hcom-MASK-users.s3.amazonaws.
>     >         > com:443[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  >> "Connection:
>     > Keep-Alive[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  >> "User-Agent:
> JetS3t/0.9.3
>     >         > (Linux/4.9.81-35.56.amzn1.x86_64; amd64; en; JVM
>     > 1.8.0_131)[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  >> "[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.headers: >> GET
>     > /XXX/YYY%2Fsentdate%3D2018-01-14
>     >         > HTTP/1.1
>     >         > 18/04/25 14:13:02 DEBUG http.headers: >> Date: Wed, 25 Apr
> 2018
>     > 14:13:02
>     >         > GMT
>     >         > 18/04/25 14:13:02 DEBUG http.headers: >> Host:
>     > hcom-MASK-prod-users.s3.
>     >         > amazonaws.com:443
>     >         > 18/04/25 14:13:02 DEBUG http.headers: >> Connection:
> Keep-Alive
>     >         > 18/04/25 14:13:02 DEBUG http.headers: >> User-Agent:
> JetS3t/0.9.3
>     >         > (Linux/4.9.81-35.56.amzn1.x86_64; amd64; en; JVM
> 1.8.0_131)
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  << "HTTP/1.1 404 Not
>     > Found[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  << "Content-Type:
>     >         > application/xml[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  << "Transfer-Encoding:
>     > chunked[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  << "Date: Wed, 25 Apr
> 2018
>     > 14:13:01
>     >         > GMT[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  << "Server:
> AmazonS3[\r][\n]"
>     >         > 18/04/25 14:13:02 DEBUG http.wire:  << "[\r][\n]"
>     >         >
>     >         > I hope it can help.
>     >         >
>     >         > Thanks,
>     >         >
>     >         > Enrico
>     >         >
>     >         >
>     >         > On 4/25/18, 2:32 AM, "William Guo" <[email protected]>
> wrote:
>     >         >
>     >         >     hi Enrico,
>     >         >
>     >         >     We don't know why aws response 404, could you share
> your log
>     > for us to
>     >         >     trouble shooting?
>     >         >     BTW, can we access your aws instance? that will help
> us find
>     > the issue.
>     >         >
>     >         >
>     >         >     Thanks,
>     >         >     William
>     >         >
>     >         >     On Wed, Apr 25, 2018 at 12:08 AM, Enrico D'Urso <
>     > [email protected]>
>     >         > wrote:
>     >         >
>     >         >     > Hi,
>     >         >     >
>     >         >     > Yes, we did all of those things.
>     >         >     > Spark has the correct Hive metastore URI set and
> also has
>     > the right
>     >         >     > credentials for S3 (where the data is actually
> stored).
>     >         >     > The main problem is that when trying to fetch data
> from
>     > any table/
>     >         > any DB
>     >         >     > we get a File not found exception:
>     >         >     >
>     >         >     > Caused by: java.util.concurrent.ExecutionException:
>     >         >     > java.io.FileNotFoundException: File
>     > s3://XXXXXX-common/XXXX_dm/
>     >         >     > XXXX_trip_details/ctp-20180423t221106.941z-58moytj7/
>     >         > bk_date=2016-12-13
>     >         >     > does not exist.
>     >         >     >
>     >         >     > I checked on s3 and it does exists, although there
> is an
>     > additional
>     >         > level
>     >         >     > after ‘bk_date=2016-12-13’ . The complete path is as
>     > follows:
>     >         >     > s3://XXXXXX-common/XXXX_dm/XXXX_trip_details/ctp-
>     >         >     > 20180423t221106.941z-58moytj7/bk_date=2016-12-13/xyz
>     >         >     >
>     >         >     > Anyone has tested the Docker image to work with S3
> instead
>     > of HDFS?
>     >         >     >
>     >         >     >
>     >         >     > Thanks,
>     >         >     >
>     >         >     > Enrico
>     >         >     > From: Lionel Liu <[email protected]>
>     >         >     > Date: Friday, April 13, 2018 at 10:20 AM
>     >         >     > To: "[email protected]" <
>     >         > [email protected]>,
>     >         >     > Enrico D'Urso <[email protected]>
>     >         >     > Subject: Re: Griffin on Docker - modify Hive
> metastore Uris
>     >         >     >
>     >         >     > Hi Enrico,
>     >         >     >
>     >         >     >
>     >         >     > I think you need to copy hive-site.xml into spark
> config
>     > directory,
>     >         > or
>     >         >     > explicitly set hive-site.xml in spark-shell command
> line.
>     >         >     > Because spark shell creates its sqlContext when
> start up,
>     > after then,
>     >         >     > setConf will not work.
>     >         >     >
>     >         >     >
>     >         >     >
>     >         >     > Thanks,
>     >         >     > Lionel
>     >         >     >
>     >         >     > On Thu, Apr 12, 2018 at 6:04 PM, Enrico D'Urso <
>     > [email protected]
>     >         >     > <mailto:[email protected]>> wrote:
>     >         >     > Hi,
>     >         >     >
>     >         >     > After further investigation, I noticed that Spark is
>     > pointing to the
>     >         > east
>     >         >     > Aws region, by default.
>     >         >     > Any suggestion to force it to use us-west2?
>     >         >     >
>     >         >     > Thanks
>     >         >     >
>     >         >     > From: Enrico D'Urso <[email protected]<mailto:a-
>     > [email protected]
>     >         > >>
>     >         >     > Date: Wednesday, April 11, 2018 at 3:55 PM
>     >         >     > To: Lionel Liu <[email protected]<mailto:l
>     > [email protected]>>,
>     >         > "
>     >         >     > [email protected]<mailto:dev@griffin.
>     >         > incubator.apache.org>"
>     >         >     > <[email protected]<mailto:dev@griffin
> .
>     >         > incubator.apache.org
>     >         >     > >>
>     >         >     > Subject: Re: Griffin on Docker - modify Hive
> metastore Uris
>     >         >     >
>     >         >     > Hi Lionel,
>     >         >     >
>     >         >     > Thank you for your email.
>     >         >     >
>     >         >     > Right now, I am testing Spark cluster using the
> Spark-shell
>     >         > available on
>     >         >     > your Docker image. I just wanted to test it before
> running
>     > any
>     >         > ‘measure
>     >         >     > job’ to tackle any configuration issue.
>     >         >     > I start the shell as follows:
>     >         >     > spark-shell --deploy-mode client --master yarn
>     >         >     > --packages=org.apache.hadoop:hadoop-aws:2.6.5
>     >         >     >
>     >         >     > I am fetching Hadoop-aws:2.6.5 as 2.6.5 is the Hadoop
>     > version that is
>     >         >     > included in the Docker image.
>     >         >     > So far, so good, then I also set the right Hive
> metastore
>     > URI:
>     >         >     > sqlContext.setConf("hive.metastore.uris",
> metastoreURI)
>     >         >     >
>     >         >     > the problem arises when I try to fetch any table for
>     > instance:
>     >         >     > sqlContext.sql("Select * from
> hcom_data_prod_.testtable").
>     > take(2)
>     >         >     >
>     >         >     > the table does exist, but I get an error back saying
> that:
>     >         >     >
>     >         >     > Caused by: java.io.FileNotFoundException: File
>     > s3://hcom-xxXXXxx/yyy
>     >         >     > /testtable/sentdate=2017-10-13 does not exist.
>     >         >     >
>     >         >     > But it does exist, basically AWS is responding with
> 404
>     > http message.
>     >         >     > I think I would get the same error if I try to run
> any
>     > ‘measure
>     >         > job’, so I
>     >         >     > prefer to tackle this earlier.
>     >         >     >
>     >         >     > Are you aware of any S3 endpoint misconfiguration
> with old
>     > version of
>     >         >     > Hadoop-aws?
>     >         >     >
>     >         >     > Many thanks,
>     >         >     >
>     >         >     > Enrico
>     >         >     >
>     >         >     >
>     >         >     > From: Lionel Liu <[email protected]<mailto:l
>     > [email protected]>>
>     >         >     > Date: Wednesday, April 11, 2018 at 3:34 AM
>     >         >     > To: "[email protected]<mailto:
> dev@griffin.
>     >         >     > incubator.apache.org>" <
> [email protected]
>     > <mailto:
>     >         >     > [email protected]>>, Enrico D'Urso <
>     >         > [email protected]
>     >         >     > <mailto:[email protected]>>
>     >         >     > Subject: Re: Griffin on Docker - modify Hive
> metastore Uris
>     >         >     >
>     >         >     > Hi Enrico,
>     >         >     >
>     >         >     > Griffin service only need to get metadata from hive
>     > metastore
>     >         > service, it
>     >         >     > doesn't fetch hive table data actually.
>     >         >     > Griffin measure, which runs on spark cluster, needs
> to
>     > fetch hive
>     >         > table
>     >         >     > data, you need to pass the AWS credentials to it when
>     > submit. I
>     >         > recommend
>     >         >     > you try the shell-submit way to submit the measure
> module
>     > first.
>     >         >     >
>     >         >     >
>     >         >     >
>     >         >     > Thanks,
>     >         >     > Lionel
>     >         >     >
>     >         >     > On Tue, Apr 10, 2018 at 9:48 PM, Enrico D'Urso <
>     > [email protected]
>     >         >     > <mailto:[email protected]><mailto:
> [email protected]<
>     > mailto:a-
>     >         >     > [email protected]>>> wrote:
>     >         >     > Hi,
>     >         >     >
>     >         >     > I have just set up the Griffin Docker image and it
> seems
>     > to work ok,
>     >         > I am
>     >         >     > able to view the sample data that comes by default.
>     >         >     >
>     >         >     > Now, I would like to test a bit the metrics things
> against
>     > a subset
>     >         > of a
>     >         >     > table that I have in our Hive instance;
>     >         >     > In particular the configuration is as follows:
>     >         >     > - Hive Metastore on RDS (Mysql on Amazon)
>     >         >     > -Actual data on  Amazon S3
>     >         >     >
>     >         >     > The machine in which Docker is running has access to
> the
>     > metastore
>     >         > and
>     >         >     > also can potentially fetch data from S3.
>     >         >     >
>     >         >     > I connected into the Docker image and now I am
> checking the
>     >         > following file:
>     >         >     > /root/service/config/application.properties
>     >         >     >
>     >         >     > in which I see the hive.metastore.uris that I can
>     > potentially modify.
>     >         >     > I would also need to pass to Griffin the AWS
> credentials
>     > to be able
>     >         > to
>     >         >     > fetch data from S3.
>     >         >     >
>     >         >     > Anyone has experience on this?
>     >         >     >
>     >         >     > Thanks,
>     >         >     >
>     >         >     > E.
>     >         >     >
>     >         >     >
>     >         >
>     >         >
>     >         >
>     >
>     >
>     >
>     >
>     >
>
>
>

Re: Griffin on Docker - modify Hive metastore Uris

Reply via email to