[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

GitBox Mon, 20 Jul 2020 08:52:02 -0700


MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457513839




##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most 
valuable asset in many companies: it's always the base for and product of any 
analysis or business logic. With an ever growing number of people working with 
data, it's a common practice for companies to build self-service platforms with 
the goal of democratising its access across different teams and — especially — 
to enable users from any background to be independent in their data needs. In 
such environments, metadata management becomes a crucial aspect. Without it, 
users often work blindly, spending too much time searching for datasets and 
their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a 
metastore, catalog, or schema registries of some sort in place. Those let you 
clearly separate making the data available from consuming it. That separation 
has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and 
shifting the focus on building new models/pipelines rather than data cleansing 
and discovery.
+* security - You can control the access to certain features of the data. For 
example, you can make the schema of dataset publicly available, but limit the 
actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much 
easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known 
in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, 
what are the constraints (e.g. keys) on which the updates should be performed, 
which fields can act as time attributes, what are the rules for watermark 
generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do 
you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful 
when creating an execution plan of our query. For example, we can choose the 
best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually 
store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be 
helpful in different use cases. Instead of having to create them in each place 
separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but 
want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allow to 
integrate it with various catalog implementations. With the help of those APIs, 
you can query tables in Flink that were created in your external catalogs (e.g. 
Hive Metastore). Additionally, depending on the catalog implementation, you can 
create new objects such as tables or views from Flink, reuse them across 
different jobs, and possibly even use them in other tools compatible with that 
catalog. As of Flink 1.11, there are two catalog implementations supported by 
the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and 
requires other systems to consume input from and write its output. This means 
that Flink does not own the lifecycle of the data. Integration with Catalogs 
does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these 
metastores is to create corresponding catalogs with connection parameters. Once 
this is done, you can use them the way you would in any relational database 
management system.

Review comment:
       ```suggestion
   All you need to do to start querying your tables defined in either of these 
metastores is to create the corresponding catalogs with connection parameters. 
Once this is done, you can use them the way you would in any relational 
database management system.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Reply via email to