[ 
https://issues.apache.org/jira/browse/SAMZA-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115710#comment-14115710
 ] 

Chris Riccomini commented on SAMZA-353:
---------------------------------------

bq. However, what happens when a few hours pass by and new data is written to 
ip-domain ? Do we again give ip-domain more priority ? Or do we continue to 
multiplex messages from these two streams ?

When new data is written, the data is consumed at a higher priority than the 
other streams. The caveat here is that the bootstrap will fully block until the 
stream has been processed up to "head", before it processes any other streams. 
Once the bootstrap takes place, this behavior stops, and the bootstrap stream 
just becomes a stream with a very high priority, so it is favored whenever 
there's more data available, but there's still the possibility for interleaving 
of streams.

bq. My opinion: We only give higher priority to the boostrap streams on startup 
- after that we treat all the streams as equally important and live with the 
resulting staleness.

This feature doesn't exist yet, but we could probably add it in the 
DefaultChooser. Right now, it's just always high priority.

bq. For a given container - do we still read from all the partitions ? Or do we 
only read from the partition(s) assigned to that container ? It seems to me 
that from this design -> you should only read from the assigned partitions. Can 
you confirm ?

Yes, you're correct. Each StreamTask would only read the assigned partitions 
for the bootstrap stream. This works because the StreamTask then writes the 
message into a shared store, which gets replicated (via changelog) to all 
containers.

bq. If we do indeed read from different partitions for ip-domain (and then use 
Kafka for making sure all the containers get all the data) what is the 
guarantee that all the containers have fully bootstrapped the global state 
store ? This new technique is more asynchronous since the writes and reads are 
separated by Kafka.

You raise a good point. Off the top of my head, I don't think we can provide 
such a guarantee. A StreamTask won't know when their shared state store has 
been fully bootstrapped. Perhaps we are muddling two different use cases here:

# Bootstrapping a global shared read-only cache.
# Having a shared state store that StreamTasks within a task can all use to do 
things like global counts.

These two seem different since (1) is read-only, and might require atomic data 
loads, while (2) is incremental, and read-write.

> Support assigning the same SSP to multiple tasknames
> ----------------------------------------------------
>
>                 Key: SAMZA-353
>                 URL: https://issues.apache.org/jira/browse/SAMZA-353
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.8.0
>            Reporter: Jakob Homan
>         Attachments: DESIGN-SAMZA-353-0.md, DESIGN-SAMZA-353-0.pdf
>
>
> Post SAMZA-123, it is possible to add the same SSP to multiple tasknames, 
> although currently we check for this and error out if this is done.  We 
> should think through the implications of having the same SSP appear in 
> multiple tasknames and support this if it makes sense.  
> This could be used as a broadcast stream that's either added by Samza itself 
> to each taskname, or individual groupers could do this as makes sense.  Right 
> now the container maintains a map of SSP to TaskInstance and delivers the ssp 
> to that task instance.  With this change, we'd need to change the map to SSP 
> to Set[TaskInstance] and deliver the message to each TI in the set.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to