[GitHub] [flink] sjwiesman commented on a change in pull request #14145: [FLINK-20239][docs] Confusing pages: Hive Read & Write and Hive Strea…

GitBox Fri, 20 Nov 2020 10:48:23 -0800


sjwiesman commented on a change in pull request #14145:
URL: https://github.com/apache/flink/pull/14145#discussion_r527904542




##########
File path: docs/dev/table/hive/hive_read_write.md
##########
@@ -22,119 +22,199 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-Using the `HiveCatalog` and Flink's connector to Hive, Flink can read and 
write from Hive data as an alternative to Hive's batch engine.
-Be sure to follow the instructions to include the correct [dependencies]({{ 
site.baseurl }}/dev/table/hive/#depedencies) in your application.
-And please also note that Hive connector only works with blink planner.
+Using the HiveCatalog, Apache Flink can be used for unified `BATCH` and STREAM 
processing of Apache 
+Hive Tables. This means Flink can be used as a more performant alternative to 
Hive’s batch engine,
+or to continuously read and write data into and out of Hive tables to power 
real-time data
+warehousing applications. 
+
+<div class="alert alert-info">
+   <b>IMPORTANT:</b> Reading and writing to and from Apache Hive is only 
supported by the Blink table planner.
+</div>
 
 * This will be replaced by the TOC
 {:toc}
 
-## Reading From Hive
+## Reading
 
-Assume Hive contains a single table in its `default` database, named people 
that contains several rows.
+Flink supports reading data from Hive in both `BATCH` and `STREAMING` modes. 
When run as a `BATCH`
+application, Flink will execute its query over the state of the table at the 
point in time when the
+query is executed. `STREAMING` reads will continuously monitor the table and 
incrementally fetch
+new data as it is made available. Flink will read tables as bounded by default.
+
+`STREAMING` reads support consuming both partitioned and non-partitioned 
tables. 
+For partitioned tables, Flink will monitor the generation of new partitions, 
and read
+them incrementally when available. For non-partitioned tables, Flink will 
monitor the generation
+of new files in the folder and read new files incrementally.
+
+<table class="table table-bordered">
+  <thead>
+    <tr>
+        <th class="text-left" style="width: 20%">Key</th>
+        <th class="text-left" style="width: 15%">Default</th>
+        <th class="text-left" style="width: 10%">Type</th>
+        <th class="text-left" style="width: 55%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+        <td><h5>streaming-source.enable</h5></td>
+        <td style="word-wrap: break-word;">false</td>
+        <td>Boolean</td>
+        <td>Enable streaming source or not. NOTES: Please make sure that each 
partition/file should be written atomically, otherwise the reader may get 
incomplete data.</td>
+    </tr>
+    <tr>
+        <td><h5>streaming-source.monitor-interval</h5></td>
+        <td style="word-wrap: break-word;">1 m</td>
+        <td>Duration</td>
+        <td>Time interval for consecutively monitoring partition/file.</td>
+    </tr>
+    <tr>
+        <td><h5>streaming-source.consume-order</h5></td>
+        <td style="word-wrap: break-word;">create-time</td>
+        <td>String</td>
+        <td>The consume order of streaming source, support create-time and 
partition-time. create-time compare partition/file creation time, this is not 
the partition create time in Hive metaStore, but the folder/file modification 
time in filesystem; partition-time compare time represented by partition name, 
if the partition folder somehow gets updated, e.g. add new file into folder, it 
can affect how the data is consumed. For non-partition table, this value should 
always be 'create-time'.</td>
+    </tr>
+    <tr>
+        <td><h5>streaming-source.consume-start-offset</h5></td>
+        <td style="word-wrap: break-word;">1970-00-00</td>
+        <td>String</td>
+        <td>Start offset for streaming consuming. How to parse and compare 
offsets depends on your order. For create-time and partition-time, should be a 
timestamp string (yyyy-[m]m-[d]d [hh:mm:ss]). For partition-time, will use 
partition time extractor to extract time from partition.</td>
+    </tr>
+  </tbody>
+</table>
+
+[SQL Hints]({% link dev/table/sql/hints.md %}) can be used to apply 
configurations to a Hive table
+without changing its definition in the Hive metastore.
+
+{% highlight sql %}
+
+SELECT * 
+FROM hive_table 
+/*+ OPTIONS('streaming-source.enable'='true', 
'streaming-source.consume-start-offset'='2020-05-20') */;
 
-{% highlight bash %}
-hive> show databases;
-OK
-default
-Time taken: 0.841 seconds, Fetched: 1 row(s)
-
-hive> show tables;
-OK
-Time taken: 0.087 seconds
-
-hive> CREATE TABLE mytable(name string, value double);
-OK
-Time taken: 0.127 seconds
-
-hive> SELECT * FROM mytable;
-OK
-Tom   4.72
-John  8.0
-Tom   24.2
-Bob   3.14
-Bob   4.72
-Tom   34.9
-Mary  4.79
-Tiff  2.72
-Bill  4.33
-Mary  77.7
-Time taken: 0.097 seconds, Fetched: 10 row(s)
 {% endhighlight %}
 
-With the data ready your can connect to Hive [connect to an existing Hive 
installation]({{ site.baseurl }}/dev/table/hive/#connecting-to-hive) and begin 
querying. 
+**Notes**
 
-{% highlight bash %}
+- Monitor strategy is to scan all directories/files currently in the location 
path. Many partitions may cause performance degradation.
+- Streaming reads for non-partitioned tables requires that each file be 
written atomically into the target directory.
+- Streaming reading for partitioned tables requires that each partition should 
be added atomically in the view of hive metastore. If not, new data added to an 
existing partition will be consumed.
+- Streaming reads do not support watermark grammar in Flink DDL. These tables 
cannot be used for window operators.
 
-Flink SQL> show catalogs;
-myhive
-default_catalog
+## Reading Hive Views
 
-# ------ Set the current catalog to be 'myhive' catalog if you haven't set it 
in the yaml file ------
+Flink is able to read from Hive defined views, but some limitations apply:
 
-Flink SQL> use catalog myhive;
+1) The Hive catalog must be set as the current catalog before you can query 
the view. 
+This can be done by either `tableEnv.useCatalog(...)` in Table API or `USE 
CATALOG ...` in SQL Client.
 
-# ------ See all registered database in catalog 'mytable' ------
+2) Hive and Flink SQL have different syntax, e.g. different reserved keywords 
and literals.
+Make sure the view’s query is compatible with Flink grammar.
 
-Flink SQL> show databases;
-default
+### Temporal Table Join

Review comment:
       There's an open PR to expand temporal table join documentation, why 
don't you comment there #14152




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] sjwiesman commented on a change in pull request #14145: [FLINK-20239][docs] Confusing pages: Hive Read & Write and Hive Strea…

Reply via email to