dawidwys commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458029793



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most 
valuable asset in many companies: it's always the base for — and product of — 
any analysis or business logic. With an ever-growing number of people working 
with data, it's a common practice for companies to build self-service platforms 
with the goal of democratizing their access across different teams and — 
especially — to enable users from any background to be independent in their 
data needs. In such environments, metadata management becomes a crucial aspect. 
Without it, users often work blindly, spending too much time searching for 
datasets and their location, figuring out data formats and similar cumbersome 
tasks.
+
+Frequently, companies start building a data platform with a metastore, 
catalog, or schema registry of some sort already in place. Those let you 
clearly separate making the data available from consuming it. That separation 
has a few benefits:
+
+* **Improved productivity** - The most obvious one. Making data reusable and 
shifting the focus on building new models/pipelines rather than data cleansing 
and discovery.
+* **Security** - You can control the access to certain features of the data. 
For example, you can make the schema of the dataset publicly available, but 
limit the actual access to the underlying data only to particular teams.
+* **Compliance** - If you have all the metadata in a central entity, it's much 
easier to ensure compliance with GDPR and similar regulations and legal 
requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known 
in order to consume them. Those include:
+
+* **Schema** - It describes the actual contents of the data, what columns it 
has, what are the constraints (e.g. keys) on which the updates should be 
performed, which fields can act as time attributes, what are the rules for 
watermark generation and so on.
+
+* **Location** - Does the data come from Kafka or a file in a filesystem? How 
do you connect to the external system? Which topic or file name do you use?
+
+* **Format** - Is the data serialized as JSON, CSV, or maybe Avro records?
+
+* **Statistics** - You can also store additional information that can be 
useful when creating an execution plan of your query. For example, you can 
choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually 
store other objects that can be reused in different scenarios, such as:
+
+* **Functions** - It's very common to have domain specific functions that can 
be helpful in different use cases. Instead of having to create them in each 
place separately, you can just create them once and share them with others.
+
+* **Queries** - Those can be useful when you don’t want to persist a data set, 
but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to 
integrate Flink with various catalog implementations. With the help of those 
APIs, you can query tables in Flink that were created in your external catalogs 
(e.g. Hive Metastore). Additionally, depending on the catalog implementation, 
you can create new objects such as tables or views from Flink, reuse them 
across different jobs, and possibly even use them in other tools compatible 
with that catalog. As of Flink 1.11, there are two catalog implementations 
supported by the community:
+
+  1. A comprehensive Hive catalog
+
+  2. A Postgres catalog (preview, read-only, for now)

Review comment:
       Background: I am not the author of the Postgres catalog and I did not 
directly participate in the design.
   
   The idea of catalogs is that you could do both:
   1. store Flink specific metadata
   2. query the non-specific external data
   
   Postgres Catalog implements only the latter.  In Flink a connector imo is 
rather well defined and means either source or sink that you can use for 
reading/writing data and thus I think it does not fit in here. Integration in 
my opinion is too broad and not well defined. The purpose of a catalog is to 
read/write/make use of metadata.
   
   It is Postgres only because, it was implemented that way. I was also 
surprised when I was writing the blogpost and preparing the demo. I think there 
is a lot of potential for better unification the current implementation across 
different DBs.
   
   In the post I tried to give a high level idea why you should think of 
catalogs when working with SQL and give an overview in a form of an e2e example 
what you can achieve in Flink. My intention was not to give a comprehensive 
overview of all available features. Nevertheless I am open for suggestion if 
you think a differently orientated post would make more sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to