Re: Griffin on Docker - modify Hive metastore Uris

Enrico D'Urso Wed, 25 Apr 2018 07:24:14 -0700

Hi guys,

Thank you for your email.
My company is pretty interested in using Griffin (and maybe contribute to the 
code), but being able to use it with S3 (Aws in general) instead of HDFS is a 
crucial point.
Let me share my configuration with you, I hope this can help to trouble shoot 
the issue. I believe that in case it does not, we can organize a call where I 
can share my screen.


Let’s start with core-site.xml in the following directory:
root@griffin:/apache/spark/conf# 

So it is the one that Spark uses. Here the complete xml: 
https://paste.ofcode.org/cfZFkRcGPsPshhew6X6HPL
However, the important item is:

<property>
    <name>hive.metastore.uris</name>
    
<value>thrift://shared-XXXXX-dance.us-west-2.hcom-sandbox-aws.aws.hcom:48869</value>
    <description>Thrift URI for the remote metastore. Used by metastore client 
to connect to remote metastore.</description>
  </property>

which works fine as I can see DBs and tables when using spark-shell.

The second file I modified is core-site.xml here:
root@griffin:/apache/hadoop-2.6.5/etc/hadoop# 
Complete file is here: https://paste.ofcode.org/k3HZqb6gEDhJd8XM9Pv45u
But the important point is:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>XXXXX</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>XXXXXX</value>
</property>

the values are masked, but I can confirm that the values are correct, as it is 
able to authenticate with AWS.

Finally, this is the way I run Spark-shell:
spark-shell --deploy-mode client --master yarn 
--packages=org.apache.hadoop:hadoop-aws:2.6.5, com.amazonaws:aws-java-sdk:1.7.4
please note the packages flag, which downloads the required packages to connect 
with AWS.

Once that the spark-shell is opened I have no problem in viewing the DBs:
sqlContext.sql("show databases").collect().foreach(println(_))
It works and the result is correct.
Then when I try to select any table:
sqlContext.sql("Select * from XX.YY").take(2)

I get the error:
Caused by: java.util.concurrent.ExecutionException: 
java.io.FileNotFoundException: File s3://bucketName/XX/YY/sentdate=2018-01-14 
does not exist.
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
        ... 93 more
Caused by: java.io.FileNotFoundException: File 
s3://hcom-data-prod-users/user_tech/email_testing/sentdate=2018-01-14 does not 
exist.
        at 
org.apache.hadoop.fs.s3.S3FileSystem.listStatus(S3FileSystem.java:195)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525)
        at org.apache.hadoop.fs.FileSystem$4.<init>(FileSystem.java:1682)
        at 
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1681)
        at 
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1664)
        at 
org.apache.hadoop.hive.shims.Hadoop23Shims.listLocatedStatus(Hadoop23Shims.java:667)
        at 
org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:361)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:634)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:620)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

in fact, enabling debug mode, I see the HTTP-header request and response:

 18/04/25 14:13:02 DEBUG conn.DefaultClientConnection: Sending request: GET 
/XXX/YYY%2Fsentdate%3D2018-01-14 HTTP/1.1
18/04/25 14:13:02 DEBUG http.wire:  >> "GET 
/%2Fuser_tech%2Femail_testing%2Fsentdate%3D2018-01-14 HTTP/1.1[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  >> "Date: Wed, 25 Apr 2018 14:13:02 
GMT[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  >> "Host: 
hcom-MASK-users.s3.amazonaws.com:443[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  >> "Connection: Keep-Alive[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  >> "User-Agent: JetS3t/0.9.3 
(Linux/4.9.81-35.56.amzn1.x86_64; amd64; en; JVM 1.8.0_131)[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  >> "[\r][\n]"
18/04/25 14:13:02 DEBUG http.headers: >> GET /XXX/YYY%2Fsentdate%3D2018-01-14 
HTTP/1.1
18/04/25 14:13:02 DEBUG http.headers: >> Date: Wed, 25 Apr 2018 14:13:02 GMT
18/04/25 14:13:02 DEBUG http.headers: >> Host: 
hcom-MASK-prod-users.s3.amazonaws.com:443
18/04/25 14:13:02 DEBUG http.headers: >> Connection: Keep-Alive
18/04/25 14:13:02 DEBUG http.headers: >> User-Agent: JetS3t/0.9.3 
(Linux/4.9.81-35.56.amzn1.x86_64; amd64; en; JVM 1.8.0_131)
18/04/25 14:13:02 DEBUG http.wire:  << "HTTP/1.1 404 Not Found[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  << "Content-Type: application/xml[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  << "Transfer-Encoding: chunked[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  << "Date: Wed, 25 Apr 2018 14:13:01 
GMT[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  << "Server: AmazonS3[\r][\n]"
18/04/25 14:13:02 DEBUG http.wire:  << "[\r][\n]"

I hope it can help.

Thanks,

Enrico


On 4/25/18, 2:32 AM, "William Guo" <[email protected]> wrote:

    hi Enrico,
    
    We don't know why aws response 404, could you share your log for us to
    trouble shooting?
    BTW, can we access your aws instance? that will help us find the issue.
    
    
    Thanks,
    William
    
    On Wed, Apr 25, 2018 at 12:08 AM, Enrico D'Urso <[email protected]> wrote:
    
    > Hi,
    >
    > Yes, we did all of those things.
    > Spark has the correct Hive metastore URI set and also has the right
    > credentials for S3 (where the data is actually stored).
    > The main problem is that when trying to fetch data from any table/ any DB
    > we get a File not found exception:
    >
    > Caused by: java.util.concurrent.ExecutionException:
    > java.io.FileNotFoundException: File s3://XXXXXX-common/XXXX_dm/
    > XXXX_trip_details/ctp-20180423t221106.941z-58moytj7/bk_date=2016-12-13
    > does not exist.
    >
    > I checked on s3 and it does exists, although there is an additional level
    > after ‘bk_date=2016-12-13’ . The complete path is as follows:
    > s3://XXXXXX-common/XXXX_dm/XXXX_trip_details/ctp-
    > 20180423t221106.941z-58moytj7/bk_date=2016-12-13/xyz
    >
    > Anyone has tested the Docker image to work with S3 instead of HDFS?
    >
    >
    > Thanks,
    >
    > Enrico
    > From: Lionel Liu <[email protected]>
    > Date: Friday, April 13, 2018 at 10:20 AM
    > To: "[email protected]" <[email protected]>,
    > Enrico D'Urso <[email protected]>
    > Subject: Re: Griffin on Docker - modify Hive metastore Uris
    >
    > Hi Enrico,
    >
    >
    > I think you need to copy hive-site.xml into spark config directory, or
    > explicitly set hive-site.xml in spark-shell command line.
    > Because spark shell creates its sqlContext when start up, after then,
    > setConf will not work.
    >
    >
    >
    > Thanks,
    > Lionel
    >
    > On Thu, Apr 12, 2018 at 6:04 PM, Enrico D'Urso <[email protected]
    > <mailto:[email protected]>> wrote:
    > Hi,
    >
    > After further investigation, I noticed that Spark is pointing to the east
    > Aws region, by default.
    > Any suggestion to force it to use us-west2?
    >
    > Thanks
    >
    > From: Enrico D'Urso <[email protected]<mailto:[email protected]>>
    > Date: Wednesday, April 11, 2018 at 3:55 PM
    > To: Lionel Liu <[email protected]<mailto:[email protected]>>, "
    > [email protected]<mailto:[email protected]>"
    > <[email protected]<mailto:[email protected]
    > >>
    > Subject: Re: Griffin on Docker - modify Hive metastore Uris
    >
    > Hi Lionel,
    >
    > Thank you for your email.
    >
    > Right now, I am testing Spark cluster using the Spark-shell available on
    > your Docker image. I just wanted to test it before running any ‘measure
    > job’ to tackle any configuration issue.
    > I start the shell as follows:
    > spark-shell --deploy-mode client --master yarn
    > --packages=org.apache.hadoop:hadoop-aws:2.6.5
    >
    > I am fetching Hadoop-aws:2.6.5 as 2.6.5 is the Hadoop version that is
    > included in the Docker image.
    > So far, so good, then I also set the right Hive metastore URI:
    > sqlContext.setConf("hive.metastore.uris", metastoreURI)
    >
    > the problem arises when I try to fetch any table for instance:
    > sqlContext.sql("Select * from hcom_data_prod_.testtable").take(2)
    >
    > the table does exist, but I get an error back saying that:
    >
    > Caused by: java.io.FileNotFoundException: File s3://hcom-xxXXXxx/yyy
    > /testtable/sentdate=2017-10-13 does not exist.
    >
    > But it does exist, basically AWS is responding with 404 http message.
    > I think I would get the same error if I try to run any ‘measure job’, so I
    > prefer to tackle this earlier.
    >
    > Are you aware of any S3 endpoint misconfiguration with old version of
    > Hadoop-aws?
    >
    > Many thanks,
    >
    > Enrico
    >
    >
    > From: Lionel Liu <[email protected]<mailto:[email protected]>>
    > Date: Wednesday, April 11, 2018 at 3:34 AM
    > To: "[email protected]<mailto:dev@griffin.
    > incubator.apache.org>" <[email protected]<mailto:
    > [email protected]>>, Enrico D'Urso <[email protected]
    > <mailto:[email protected]>>
    > Subject: Re: Griffin on Docker - modify Hive metastore Uris
    >
    > Hi Enrico,
    >
    > Griffin service only need to get metadata from hive metastore service, it
    > doesn't fetch hive table data actually.
    > Griffin measure, which runs on spark cluster, needs to fetch hive table
    > data, you need to pass the AWS credentials to it when submit. I recommend
    > you try the shell-submit way to submit the measure module first.
    >
    >
    >
    > Thanks,
    > Lionel
    >
    > On Tue, Apr 10, 2018 at 9:48 PM, Enrico D'Urso <[email protected]
    > <mailto:[email protected]><mailto:[email protected]<mailto:a-
    > [email protected]>>> wrote:
    > Hi,
    >
    > I have just set up the Griffin Docker image and it seems to work ok, I am
    > able to view the sample data that comes by default.
    >
    > Now, I would like to test a bit the metrics things against a subset of a
    > table that I have in our Hive instance;
    > In particular the configuration is as follows:
    > - Hive Metastore on RDS (Mysql on Amazon)
    > -Actual data on  Amazon S3
    >
    > The machine in which Docker is running has access to the metastore and
    > also can potentially fetch data from S3.
    >
    > I connected into the Docker image and now I am checking the following 
file:
    > /root/service/config/application.properties
    >
    > in which I see the hive.metastore.uris that I can potentially modify.
    > I would also need to pass to Griffin the AWS credentials to be able to
    > fetch data from S3.
    >
    > Anyone has experience on this?
    >
    > Thanks,
    >
    > E.
    >
    >

Re: Griffin on Docker - modify Hive metastore Uris

Reply via email to