[jira] [Commented] (SAMZA-482) Identify the set of operators for SQL on Samza

Yi Pan (Data Infrastructure) (JIRA) Wed, 07 Jan 2015 13:29:59 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268275#comment-14268275
 ]


Yi Pan (Data Infrastructure) commented on SAMZA-482:
----------------------------------------------------

Re-typing what I lost yesterday. :(

[~criccomini], thanks for the great summary on the discussions on the first 
draft of operator API! I will try to focus on 2) and 3) first.

{quote}
2. Is the intent of the API to make it usable for developers, or just as an 
implementation detail of SQL/DSLs?
Regarding (2), an example would be a developer that just wants to use the 
"Join" operator inside a StreamTask, and then put some custom logic before and 
after the join. Qualitatively, the API seems a bit cumbersome for a random 
developer to use. I think part of the complexity might come from using specs, 
rather than just directly passing parameters into methods. Another thought here 
is how the developer might get the messages back out. It seems like they'd have 
to write a custom operator that buffered the messages, so they could be 
retrieved.
{quote}

I agree that the overhead is probably coming from the spec classes. The sole 
purpose of the spec classes is try to encapsulate the details of the 
configuration/specification of each operator s.t. the factory class can have a 
unified interface to create all different types of operators. It also 
simplifies the constructor of operator since it can have the same function 
signature while the specification of the operator evolves over time. For random 
developers that does not want to overhead of spec classes, maybe we can provide 
simpler version of constructors on top of the build-in operators, to which the 
user can directly passing the parameters to the constructor to instantiate a 
simple operator. I will add application examples illustrating how a random 
developer can use this simpler way of instantiating the operators s.t. we can 
see how it works.

{quote}
3. The proposed API uses operators that are aware of each other. The routing 
happens within the operators (this.nextOp.process(tuple)). The alternative is 
to have the routing happen outside of the operators.
Question (3) seems related to the mutable setter methods as well. If routing is 
handled outside of the operators, it seems that the operators could be much 
more immutable, since they no longer need setter methods other than init and 
process. One trade off here would be that operators that accrue large outgoing 
message buffers within a single call might run out of memory, since the routing 
logic doesn't have a chance to run until the operator returns (this is 
essentially the equivalent of what we had with buffering messages in 
MessageCollector, before we switched to immediately sending messages when 
collector.send is called). Perhaps there are work arounds that would make this 
approach viable, though. I'll have to think about it.
{quote}

In deed, the issue raised here is the root cause of the complexity in the 
different types of operators, since the current implementation requires the 
operators to be aware of the next op s.t. its own output matches the next op's 
input. I have taken down the path of striping the routing part out of the 
operator class and create a operator routing context to handle the connection 
between operators. I personally liked this version better due to the way it 
simplified the operator classes and removal of the mutable setters. There are 
still two methods to control the execution of the operators, as [~criccomini] 
mentioned above: 1) calling the operator.process() / timeout() and let the 
operator invoke the next operator via the routing context; Or 2) calling the 
operator.process() / timeout() directly from the outside routing context and 
returns w/o invoking the next operator. I chose 1) in the implementation due to 
the following reason: in most of cases, the logic that determines whether the 
input values should trigger an output to be sent to the next operator is 
decided by the operator's internal state and logic. Moving this part of logic 
outside the operator class seem to be a bit odd to me. Hence, I chose the 
solution to break up the routing to the next operator and the decision on 
whether to trigger the next operator into two parts: i) finding the next 
operator and invoke it is done via the routing context; ii) changing the 
internal state and making the decision whether to send the output via the 
routing context is done by the operator. Hence, the operator's process() and 
timeout() functions will take one additional parameter: the routing context 
object. For example, when the operator decides that it should send the output 
to the next relation operator, it simple calls 
OperatorRoutingContext.sendToNextRelationOperator(currentOp, deltaRelation). 
The routing context object actually implements the routing part to find the 
next operator and invoke the process() method. Method 2) has the following 
advantages to a random developer: i) the control to whether to invoke the next 
operators or not is immediately coming back to the programmer when process() 
completes; ii) after the process() returns, the output of the current 
invocation of the operator can be retrieved immediately. I think that we can 
achieve the same via 1) by overriding the method in 
OperatorRoutingContext.sendToXXX() method by just recording the output, w/ 
invoking the next operator. Doing that, we can: i) provide the full control to 
the random developer on return of operator.process() without invoking the next 
operator; ii) provide output from an operator via the routing context after the 
return from process(). I will experiment more on this and post another review 
board request on this.

Thanks!


> Identify the set of operators for SQL on Samza
> ----------------------------------------------
>
>                 Key: SAMZA-482
>                 URL: https://issues.apache.org/jira/browse/SAMZA-482
>             Project: Samza
>          Issue Type: Sub-task
>            Reporter: Yi Pan (Data Infrastructure)
>            Priority: Minor
>              Labels: project
>
> This came out of a discussion between [~milinda], [~criccomini], and 
> [~nickpan47]. We think that it will be a good idea to separate the operators 
> layer from the high-level language layer, s.t. we can allow different 
> languages to be built on-top-of the same set of fundamental functions (i.e. 
> SQL-like or DSL).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-482) Identify the set of operators for SQL on Samza

Reply via email to