[ 
https://issues.apache.org/jira/browse/BEAM-4260?focusedWorklogId=130878&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-130878
 ]

ASF GitHub Bot logged work on BEAM-4260:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 03/Aug/18 14:23
            Start Date: 03/Aug/18 14:23
    Worklog Time Spent: 10m 
      Work Description: asfgit closed pull request #512: [BEAM-4260] Document 
HCatalogIO use with Hive 1.1
URL: https://github.com/apache/beam-site/pull/512
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/documentation/io/built-in-hcatalog.md 
b/src/documentation/io/built-in-hcatalog.md
new file mode 100644
index 0000000000..88e10086ef
--- /dev/null
+++ b/src/documentation/io/built-in-hcatalog.md
@@ -0,0 +1,147 @@
+---
+layout: section
+title: "Apache HCatalog InputFormat IO"
+section_menu: section-menu/documentation.html
+permalink: /documentation/io/built-in/hcatalog/
+---
+
+[Pipeline I/O Table of Contents]({{site.baseurl}}/documentation/io/io-toc/)
+
+# HCatalog IO
+
+An `HCatalogIO` is a transform for reading and writing data to an HCatalog 
managed source.
+
+### Reading using HCatalogIO
+
+To configure an HCatalog source, you must specify a metastore URI and a table 
name. Other optional parameters are database and filter.
+
+For example:
+```java
+Map<String, String> configProperties = new HashMap<String, String>();
+configProperties.put("hive.metastore.uris","thrift://metastore-host:port"); 
+pipeline
+  .apply(HCatalogIO.read()
+  .withConfigProperties(configProperties)
+  .withDatabase("default") //optional, assumes default if none specified
+  .withTable("employee")
+  .withFilter(filterString) //optional, may be specified if the table is 
partitioned
+```
+```py
+  # The Beam SDK for Python does not support HCatalogIO.
+```
+
+### Writing using HCatalogIO
+
+To configure an `HCatalog` sink, you must specify a metastore URI and a table 
name. Other
+optional parameters are database, partition and batchsize.
+The destination table should exist beforehand as the transform will not create 
a new table if missing.
+
+For example:
+```java
+Map<String, String> configProperties = new HashMap<String, String>();
+configProperties.put("hive.metastore.uris","thrift://metastore-host:port");
+
+pipeline
+  .apply(...)
+  .apply(HCatalogIO.write()
+    .withConfigProperties(configProperties)
+    .withDatabase("default") //optional, assumes default if none specified
+    .withTable("employee")
+    .withPartition(partitionValues) //optional, may be specified if the table 
is partitioned
+    .withBatchSize(1024L)) //optional, assumes a default batch size of 1024 if 
none specified
+```
+```py
+  # The Beam SDK for Python does not support HCatalogIO.
+```
+
+### Using older versions of HCatalog (1.x)
+
+`HCatalogIO` is build for Apache HCatalog versions 2 and up and will not work 
out of the box for older versions of HCatalog. 
+The following illustrates a workaround to work with Hive 1.1.
+
+Include the following Hive 1.2 jars in the über jar you build. 
+The 1.2 jars provide the necessary methods for Beam while remain compatible 
with Hive 1.1.
+ 
+```
+<dependency>
+    <groupId>org.apache.beam</groupId>
+    <artifactId>beam-sdks-java-io-hcatalog</artifactId>
+    <version>${beam.version}</version>
+</dependency>
+<dependency>
+    <groupId>org.apache.hive.hcatalog</groupId>
+    <artifactId>hive-hcatalog-core</artifactId>
+    <version>1.2</version>
+</dependency>
+<dependency>
+    <groupId>org.apache.hive</groupId>
+    <artifactId>hive-metastore</artifactId>
+    <version>1.2</version>
+</dependency>
+<dependency>
+    <groupId>org.apache.hive</groupId>
+    <artifactId>hive-exec</artifactId>
+    <version>1.2</version>
+</dependency>
+<dependency>
+    <groupId>org.apache.hive</groupId>
+    <artifactId>hive-common</artifactId>
+    <version>1.2</version>
+</dependency>
+```
+ 
+Relocate _only_ the following hive packages:
+
+```
+<plugin>
+    <groupId>org.apache.maven.plugins</groupId>
+    <artifactId>maven-shade-plugin</artifactId>
+    <version>${maven-shade-plugin.version}</version>
+    <configuration>
+        <createDependencyReducedPom>false</createDependencyReducedPom>
+        <filters>
+            <filter>
+                <artifact>*:*</artifact>
+                <excludes>
+                    <exclude>META-INF/*.SF</exclude>
+                    <exclude>META-INF/*.DSA</exclude>
+                    <exclude>META-INF/*.RSA</exclude>
+                </excludes>
+            </filter>
+        </filters>
+    </configuration>
+    <executions>
+        <execution>
+            <phase>package</phase>
+            <goals>
+                <goal>shade</goal>
+            </goals>
+            <configuration>
+                <shadedArtifactAttached>true</shadedArtifactAttached>
+                <shadedClassifierName>shaded</shadedClassifierName>
+                <transformers>
+                    <transformer 
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
+                </transformers>
+                <relocations>
+                    <!-- Important: Do not relocate org.apache.hadoop.hive -->
+                    <relocation>
+                        <pattern>org.apache.hadoop.hive.conf</pattern>
+                        
<shadedPattern>h12.org.apache.hadoop.hive.conf</shadedPattern>
+                    </relocation>
+                    <relocation>
+                        <pattern>org.apache.hadoop.hive.ql</pattern>
+                        
<shadedPattern>h12.org.apache.hadoop.hive.ql</shadedPattern>
+                    </relocation>
+                    <relocation>
+                        <pattern>org.apache.hadoop.hive.metastore</pattern>
+                        
<shadedPattern>h12.org.apache.hadoop.hive.metastore</shadedPattern>
+                    </relocation>
+                </relocations>
+            </configuration>
+        </execution>
+    </executions>
+</plugin>
+```
+
+This has been testing to read SequenceFile and ORCFile file backed tables 
running with 
+Beam 2.4.0 on Spark 2.3 / YARN in a Cloudera CDH 5.12.2 managed environment.
\ No newline at end of file
diff --git a/src/documentation/io/built-in.md b/src/documentation/io/built-in.md
index 186128465f..0cda2b6bc7 100644
--- a/src/documentation/io/built-in.md
+++ b/src/documentation/io/built-in.md
@@ -58,7 +58,7 @@ Consult the [Programming Guide I/O section]({{site.baseurl 
}}/documentation/prog
     <p><a 
href="https://github.com/apache/beam/tree/master/sdks/java/io/cassandra";>Apache 
Cassandra</a></p>
     <p><a href="{{site.baseurl}}/documentation/io/built-in/hadoop/">Apache 
Hadoop InputFormat</a></p>
     <p><a 
href="https://github.com/apache/beam/tree/master/sdks/java/io/hbase";>Apache 
HBase</a></p>
-    <p><a 
href="https://github.com/apache/beam/tree/master/sdks/java/io/hcatalog";>Apache 
Hive (HCatalog)</a></p>
+    <p><a href="{{site.baseurl}}/documentation/io/built-in/hcatalog">Apache 
Hive (HCatalog)</a></p>
     <p><a 
href="https://github.com/apache/beam/tree/master/sdks/java/io/kudu";>Apache 
Kudu</a></p>
     <p><a 
href="https://github.com/apache/beam/tree/master/sdks/java/io/solr";>Apache 
Solr</a></p>
     <p><a 
href="https://github.com/apache/beam/tree/master/sdks/java/io/elasticsearch";>Elasticsearch
 (v2.x and v5.x)</a></p>


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 130878)
    Time Spent: 2h 50m  (was: 2h 40m)

> Document usage for hcatalog 1.1
> -------------------------------
>
>                 Key: BEAM-4260
>                 URL: https://issues.apache.org/jira/browse/BEAM-4260
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-hcatalog, website
>    Affects Versions: 2.4.0
>            Reporter: Tim Robertson
>            Assignee: Tim Robertson
>            Priority: Minor
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> The {{HCatalogIO}} does not work with environments providing Hive Server 1.x 
> which is in widespread use - as an example the latest Cloudera (5.14.2) 
> provides 1.1.x
>  
> The {{HCatalogIO}} marks it's Hive dependencies as provided, so I believe the 
> intention was to be open to multiple versions.
>  
> The issues come from the following:  
>  - use of {{HCatUtil.getHiveMetastoreClient(hiveConf)}} while previous 
> versions used the [now 
> deprecated|https://github.com/apache/hive/blob/master/hcatalog/core/src/main/java/org/apache/hive/hcatalog/common/HCatUtil.java#L586]
>  {{getHiveClient(HiveConf hiveConf)}}  
>  - Changes to the signature of {{RetryingMetaStoreClient.getProxy(...)}}
>  
> Given this doesn't work in a major Hadoop distro, and will not until the next 
> CDH release later in 2018 (i.e. widespread adoption only expected in 2019) I 
> think it would be worthwhile providing a fix/workaround.
> I _think_ building for 2.3 and relocating in your own app might be a 
> workaround although I'm still testing it.  If that is successful I'd propose 
> adding it to the project README or in a separate markdown file linked from 
> the README.
> Does that sound like a reasonable approach please?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to