[jira] [Commented] (FLINK-8240) Create unified interfaces to configure and instatiate TableSources

Timo Walther (JIRA) Mon, 18 Dec 2017 09:59:13 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295353#comment-16295353
 ]


Timo Walther commented on FLINK-8240:
-------------------------------------

Hi everyone,

I think we don't need a design document for it but it would be great to hear 
some opinions. I introduced descriptors that allow to describe connectors, 
encoding, and time attributes. 

My current API design looks like:

{code}
tableEnv
      .from(
        FileSystem()
          .path("/path/to/csv"))
      .withEncoding(
        CSV()
          .field("myfield", Types.STRING)
          .field("myfield2", Types.INT)
          .quoteCharacter(';')
          .fieldDelimiter("#")
          .lineDelimiter("\r\n")
          .commentPrefix("%%")
          .ignoreFirstLine()
          .ignoreParseErrors())
      .withRowtime(
        Rowtime()
          .onField("rowtime")
          .withTimestampFromDataStream()
          .withWatermarkFromDataStream())
      .withProctime(
        Proctime()
          .onField("myproctime"))
      .toTableSource()
{code}

These descriptors are converted into pure key-value properties. Such as:

{code}
"connector.filesystem.path" -> "/myfile"
"encoding.csv.fields.0.name" -> "field1",
"encoding.csv.fields.0.type" -> "STRING",
"encoding.csv.fields.1.name" -> "field2",
"encoding.csv.fields.1.type" -> "TIMESTAMP",
"encoding.csv.fields.2.name" -> "field3",
"encoding.csv.fields.2.type" -> "ANY(java.lang.Class)",
"encoding.csv.fields.3.name" -> "field4",
"encoding.csv.fields.3.type" -> "ROW(test INT, row VARCHAR)",
"encoding.csv.line-delimiter" -> "^"
{code}

The properties are fully expressed as strings. This allows to save them also in 
configuration files. Which might be interesting for FLINK-7594.

The question is how do we want to translate the properties into actual table 
sources. Or more precisely: How do we want to supply converters? Should they be 
part of the {{TableSource}} interface? Or should table sources be annotated 
with some factory class? Right now we have a similar functionality for external 
catalogs but this is too specific and does not consider encodings or time 
attributes. Furthermore, it would be better to use Java {{ServiceLoader}}s 
instead of classpath scanning. This is also used for Flink's file systems.

So my idea would be to have a class {{TableFactory}} that declares a connector 
e.g. "kafka_0.10" and supported encodings "csv", "avro" (similar to 
FLINK-7643). All built-in table sources need to provide such a factory.

What do you think? [~fhueske] [~jark] [~wheat9] [~ykt836]


> Create unified interfaces to configure and instatiate TableSources
> ------------------------------------------------------------------
>
>                 Key: FLINK-8240
>                 URL: https://issues.apache.org/jira/browse/FLINK-8240
>             Project: Flink
>          Issue Type: New Feature
>          Components: Table API & SQL
>            Reporter: Timo Walther
>            Assignee: Timo Walther
>
> At the moment every table source has different ways for configuration and 
> instantiation. Some table source are tailored to a specific encoding (e.g., 
> {{KafkaAvroTableSource}}, {{KafkaJsonTableSource}}) or only support one 
> encoding for reading (e.g., {{CsvTableSource}}). Each of them might implement 
> a builder or support table source converters for external catalogs.
> The table sources should have a unified interface for discovery, defining 
> common properties, and instantiation. The {{TableSourceConverters}} provide a 
> similar functionality but use an external catalog. We might generialize this 
> interface.
> In general a table source declaration depends on the following parts:
> {code}
> - Source
>   - Type (e.g. Kafka, Custom)
>   - Properties (e.g. topic, connection info)
> - Encoding
>   - Type (e.g. Avro, JSON, CSV)
>   - Schema (e.g. Avro class, JSON field names/types)
> - Rowtime descriptor/Proctime
>   - Watermark strategy and Watermark properties
>   - Time attribute info
> - Bucketization
> {code}
> This issue needs a design document before implementation. Any discussion is 
> very welcome.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-8240) Create unified interfaces to configure and instatiate TableSources

Reply via email to