[ 
https://issues.apache.org/jira/browse/SAMZA-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115682#comment-14115682
 ] 

Chinmay Soman commented on SAMZA-353:
-------------------------------------

Essentially - we are building a distributed read-only key value store on top of 
Kafka ? Seems very useful.

Although, I have a couple of questions
1)  Priority of bootstrap stream ?
In case of ip-domain: marking it as a 'bootstrap=True' stream works when the 
container is starting up. In this phase, the MessageChooser will simply 
prioritize 'ip-domain' messages over those contained in 'page-views' - neat ! 
However, what happens when a few hours pass by and new data is written to 
ip-domain ? Do we again give ip-domain more priority ? Or do we continue to 
multiplex messages from these two streams ?   

Pros of always giving the bootstrap stream more priority: we are always 
guaranteed to have the latest data in the global state store
Cons: This is essentially bringing the container to a halt until the bootstrap 
is done.

My opinion: We only give higher priority to the boostrap streams on startup - 
after that we treat all the streams as equally important and live with the 
resulting staleness.

2) Reading bootstrap stream (ip-domain) during startup ?
For a given container - do we still read from all the partitions ? Or do we 
only read from the partition(s) assigned to that container ? It seems to me 
that from this design -> you should only read from the assigned partitions. Can 
you confirm ?

If we do indeed read from different partitions for ip-domain (and then use 
Kafka for making sure all the containers get all the data) what is the 
guarantee that all the containers have fully bootstrapped the global state 
store ? This new technique is more asynchronous since the writes and reads are 
separated by Kafka. 

Happy to talk in person if I'm not making any sense :)

> Support assigning the same SSP to multiple tasknames
> ----------------------------------------------------
>
>                 Key: SAMZA-353
>                 URL: https://issues.apache.org/jira/browse/SAMZA-353
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.8.0
>            Reporter: Jakob Homan
>         Attachments: DESIGN-SAMZA-353-0.md, DESIGN-SAMZA-353-0.pdf
>
>
> Post SAMZA-123, it is possible to add the same SSP to multiple tasknames, 
> although currently we check for this and error out if this is done.  We 
> should think through the implications of having the same SSP appear in 
> multiple tasknames and support this if it makes sense.  
> This could be used as a broadcast stream that's either added by Samza itself 
> to each taskname, or individual groupers could do this as makes sense.  Right 
> now the container maintains a map of SSP to TaskInstance and delivers the ssp 
> to that task instance.  With this change, we'd need to change the map to SSP 
> to Set[TaskInstance] and deliver the message to each TI in the set.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to