sjwiesman commented on a change in pull request #14145:
URL: https://github.com/apache/flink/pull/14145#discussion_r527904542
##########
File path: docs/dev/table/hive/hive_read_write.md
##########
@@ -22,119 +22,199 @@ specific language governing permissions and limitations
under the License.
-->
-Using the `HiveCatalog` and Flink's connector to Hive, Flink can read and
write from Hive data as an alternative to Hive's batch engine.
-Be sure to follow the instructions to include the correct [dependencies]({{
site.baseurl }}/dev/table/hive/#depedencies) in your application.
-And please also note that Hive connector only works with blink planner.
+Using the HiveCatalog, Apache Flink can be used for unified `BATCH` and STREAM
processing of Apache
+Hive Tables. This means Flink can be used as a more performant alternative to
Hive’s batch engine,
+or to continuously read and write data into and out of Hive tables to power
real-time data
+warehousing applications.
+
+<div class="alert alert-info">
+ <b>IMPORTANT:</b> Reading and writing to and from Apache Hive is only
supported by the Blink table planner.
+</div>
* This will be replaced by the TOC
{:toc}
-## Reading From Hive
+## Reading
-Assume Hive contains a single table in its `default` database, named people
that contains several rows.
+Flink supports reading data from Hive in both `BATCH` and `STREAMING` modes.
When run as a `BATCH`
+application, Flink will execute its query over the state of the table at the
point in time when the
+query is executed. `STREAMING` reads will continuously monitor the table and
incrementally fetch
+new data as it is made available. Flink will read tables as bounded by default.
+
+`STREAMING` reads support consuming both partitioned and non-partitioned
tables.
+For partitioned tables, Flink will monitor the generation of new partitions,
and read
+them incrementally when available. For non-partitioned tables, Flink will
monitor the generation
+of new files in the folder and read new files incrementally.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left" style="width: 20%">Key</th>
+ <th class="text-left" style="width: 15%">Default</th>
+ <th class="text-left" style="width: 10%">Type</th>
+ <th class="text-left" style="width: 55%">Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><h5>streaming-source.enable</h5></td>
+ <td style="word-wrap: break-word;">false</td>
+ <td>Boolean</td>
+ <td>Enable streaming source or not. NOTES: Please make sure that each
partition/file should be written atomically, otherwise the reader may get
incomplete data.</td>
+ </tr>
+ <tr>
+ <td><h5>streaming-source.monitor-interval</h5></td>
+ <td style="word-wrap: break-word;">1 m</td>
+ <td>Duration</td>
+ <td>Time interval for consecutively monitoring partition/file.</td>
+ </tr>
+ <tr>
+ <td><h5>streaming-source.consume-order</h5></td>
+ <td style="word-wrap: break-word;">create-time</td>
+ <td>String</td>
+ <td>The consume order of streaming source, support create-time and
partition-time. create-time compare partition/file creation time, this is not
the partition create time in Hive metaStore, but the folder/file modification
time in filesystem; partition-time compare time represented by partition name,
if the partition folder somehow gets updated, e.g. add new file into folder, it
can affect how the data is consumed. For non-partition table, this value should
always be 'create-time'.</td>
+ </tr>
+ <tr>
+ <td><h5>streaming-source.consume-start-offset</h5></td>
+ <td style="word-wrap: break-word;">1970-00-00</td>
+ <td>String</td>
+ <td>Start offset for streaming consuming. How to parse and compare
offsets depends on your order. For create-time and partition-time, should be a
timestamp string (yyyy-[m]m-[d]d [hh:mm:ss]). For partition-time, will use
partition time extractor to extract time from partition.</td>
+ </tr>
+ </tbody>
+</table>
+
+[SQL Hints]({% link dev/table/sql/hints.md %}) can be used to apply
configurations to a Hive table
+without changing its definition in the Hive metastore.
+
+{% highlight sql %}
+
+SELECT *
+FROM hive_table
+/*+ OPTIONS('streaming-source.enable'='true',
'streaming-source.consume-start-offset'='2020-05-20') */;
-{% highlight bash %}
-hive> show databases;
-OK
-default
-Time taken: 0.841 seconds, Fetched: 1 row(s)
-
-hive> show tables;
-OK
-Time taken: 0.087 seconds
-
-hive> CREATE TABLE mytable(name string, value double);
-OK
-Time taken: 0.127 seconds
-
-hive> SELECT * FROM mytable;
-OK
-Tom 4.72
-John 8.0
-Tom 24.2
-Bob 3.14
-Bob 4.72
-Tom 34.9
-Mary 4.79
-Tiff 2.72
-Bill 4.33
-Mary 77.7
-Time taken: 0.097 seconds, Fetched: 10 row(s)
{% endhighlight %}
-With the data ready your can connect to Hive [connect to an existing Hive
installation]({{ site.baseurl }}/dev/table/hive/#connecting-to-hive) and begin
querying.
+**Notes**
-{% highlight bash %}
+- Monitor strategy is to scan all directories/files currently in the location
path. Many partitions may cause performance degradation.
+- Streaming reads for non-partitioned tables requires that each file be
written atomically into the target directory.
+- Streaming reading for partitioned tables requires that each partition should
be added atomically in the view of hive metastore. If not, new data added to an
existing partition will be consumed.
+- Streaming reads do not support watermark grammar in Flink DDL. These tables
cannot be used for window operators.
-Flink SQL> show catalogs;
-myhive
-default_catalog
+## Reading Hive Views
-# ------ Set the current catalog to be 'myhive' catalog if you haven't set it
in the yaml file ------
+Flink is able to read from Hive defined views, but some limitations apply:
-Flink SQL> use catalog myhive;
+1) The Hive catalog must be set as the current catalog before you can query
the view.
+This can be done by either `tableEnv.useCatalog(...)` in Table API or `USE
CATALOG ...` in SQL Client.
-# ------ See all registered database in catalog 'mytable' ------
+2) Hive and Flink SQL have different syntax, e.g. different reserved keywords
and literals.
+Make sure the view’s query is compatible with Flink grammar.
-Flink SQL> show databases;
-default
+### Temporal Table Join
Review comment:
There's an open PR to expand temporal table join documentation, why
don't you comment there #14152
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]