[jira] [Commented] (BEAM-14) Add data integration DSL

JIRA Wed, 06 Apr 2016 00:02:02 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-14?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227859#comment-15227859
 ]


Jean-Baptiste Onofré commented on BEAM-14:
------------------------------------------

Yes, I thought also about a XML DSL on top of the model.

I would allow for users to simply describe the pipelines, close to the model.

For instance, something like:

{code}
<?xml version="1.0" encoding="UTF-8"?>
<pipeline>
   <input name="txtInput">
        <attribute name="uri" value="txt"/>
        <attribute name="fileLocation" value="/tmp/kinglear.txt"/>
    </input>

    <output name="txtOutput">
                <attribute name="uri" value="txt"/>
        <attribute name="fileLocation" value="/tmp/count.txt"/>
    </output>

        <step name="ExtractWords">
                <attribute name="type" value="ParDo"/>
                <handler>
                        <className>org.apache.beam.dsl.WordExtracter</className>
                </handler>
        </step>
        <step name="CountWords">
                <attribute name="type" value="Count"/>
                <handler>
                        <className>org.apache.beam.dsl.WordCounter</className>
                </handler>
        </step>
        
        <step name="MapWords">
                <attribute name="type" value="Map"/>
                <handler>
                        <className>corg.apache.beam.dsl.WordMapper</className>
                </handler>
        </step>

        <pipe>
                <from uri="txtInput" />
                <to uri="ExtractWords" />
                <to uri="CountWords" />
                <to uri="MapWords" />
                <to uri="txtOutput" />
        </pipe>

</FlowDefinition>
{code}

> Add data integration DSL
> ------------------------
>
>                 Key: BEAM-14
>                 URL: https://issues.apache.org/jira/browse/BEAM-14
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-ideas
>            Reporter: Jean-Baptiste Onofré
>            Assignee: Jean-Baptiste Onofré
>
> Even if users would still be able to use directly the API, it would be great 
> to provide a DSL on top of the API covering batch and streaming data 
> processing but also data integration.
> Instead of designing a pipeline as a chain of apply() wrapping function 
> (DoFn), we can provide a fluent DSL allowing users to directly leverage 
> keyturn functions.
> For instance, an user would be able to design a pipeline like:
> {code}
> .from(“kafka:localhost:9092?topic=foo”).reduce(...).split(...).wiretap(...).map(...).to(“jms:queue:foo….”);
> {code}
> The DSL will allow to use existing pipelines, for instance:
> {code}
> .from("cxf:...").reduce().pipeline("other").map().to("kafka:localhost:9092?topic=foo&acks=all")
> {code}
> So it means that we will have to create a IO Sink that can trigger the 
> execution of a target pipeline: (from("trigger:other") triggering the 
> pipeline execution when another pipeline design starts with 
> pipeline("other")). We can also imagine to mix the runners: the pipeline() 
> can be on one runner, the from("trigger:other") can be on another runner). 
> It's not trivial, but it will give strong flexibility and key value for Beam.
> In a second step, we can provide DSLs in different languages (the first one 
> would be Java, but why not providing XML, akka, scala DSLs).
> We can note in previous examples that the DSL would also provide data 
> integration support to bean in addition of data processing. Data Integration 
> is an extension of Beam API to support some Enterprise Integration Patterns 
> (EIPs). As we would need metadata for data integration (even if metadata can 
> also be interesting in stream/batch data processing pipeline), we can provide 
> a DataxMessage built on top of PCollection. A DataxMessage would contain:
> structured headers
> binary payload
> For instance, the headers can contains an Avro schema to describe the payload.
> The headers can also contains useful information coming from the IO Source 
> (for instance the partition/path where the data comes from, …).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-14) Add data integration DSL

Reply via email to