[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Yi Pan (Data Infrastructure) (JIRA) Mon, 09 Feb 2015 16:20:39 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313238#comment-14313238
 ]


Yi Pan (Data Infrastructure) commented on SAMZA-390:
----------------------------------------------------

We had a discussion on some remaining SQL language related issue last Friday 
and here is my summary:
# Support for PARTITION
## Samza needs to know PARTITION key and count passed down by the SQL 
parser/planner
## PARTITION key can be added as an extension to SQL in Calcite. If missing, 
Samza will choose random partition
## PARTITION count is a system property and should not be enforced in SQL 
grammar. There are three cases we need to handle
### topic already exists in Kafka. Samza will only need to read it from Kafka 
metadata.
### topic does not exist and we allow auto-creation of topic. Samza will 
auto-create the topic w/ default partition count
### topic does not exist and auto-creation is not allowed. It will require the 
user to perform an admin op to create the topics first. Then, Samza can get it 
the PARTITION count from Kafka
# Schema and Metadata support
## Schema definition and DDL
### We have decided that metadata registry to store schema definition from DDL 
is optional. The impact is whether we can do a compile time validation or 
runtime validation: compile time validation is possible when schema metadata is 
supplied.
### Two examples: with Avro schema registry, we can implement an schema 
metadata interface s.t. Calcite validation module can be applied to perform 
compile time validation; while with JSON, the validation would be skipped and 
we opt to get runtime validation errors.
## Tuple schema
### If we defines tuple schema in a stream, should we support multiple schemas 
in a single stream? There seems to be possible use non-SQL cases for multiple 
schemas in a single stream, e.g. a split a stream to multiple according to 
different schema. It seems to be reasonable to ask the Samza physical operator 
to support multiple schemas in a single stream (i.e. schema is associated w/ 
tuple) while no SQL language support is needed. The feature can potentially 
used by other DSL languages that may implement m-schemas in a single stream.
# Window syntax and semantics
## How much syntax support we need from SQL language? I opened a ticket to 
track that: SAMZA-551
## Tuple vs Time based window. I opened a ticket to track that as well: 
SAMZA-552

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>          Components: sql
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>         Attachments: StreamSQLforSAMZA-v0.1.docx.docx
>
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Reply via email to