[
https://issues.apache.org/jira/browse/SAMZA-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aditya reassigned SAMZA-1613:
-----------------------------
Assignee: Ahmed Abdul Hamid
> Stream-table join: Intermediate streams do not inherit bootstrap stream
> semantic
> --------------------------------------------------------------------------------
>
> Key: SAMZA-1613
> URL: https://issues.apache.org/jira/browse/SAMZA-1613
> Project: Samza
> Issue Type: Bug
> Reporter: Aditya
> Assignee: Ahmed Abdul Hamid
> Priority: Major
>
> There are two issues with repartitioning a bootstrap stream:
> * Samza does not propagate the bootstrap semantics to the intermediate
> stream.
> * For a bootstrap stream, Samza gets the newest offset for that stream
> before consumption and considers the bootstrap to be complete once the
> message with that offset is consumed. This is clearly a problem for
> intermediate streams as there might not be any messages at the beginning of
> the job.
> Even though the first stage abides by the bootstrap semantics and does not
> consume from non-bootstrap streams until all the bootstrap stream partitions
> owned by that container are re-partitioned, there will be scenarios where one
> container finishes bootstrap faster than others and also the rate of
> consumption from the intermediate stream in the second stage might be slower.
> All these would result in the job breaking the bootstrap semantics across the
> two stages. Please note that this issue is not just specific to
> repartitioning but could happen with any intermediate streams.
> To support this, we will have to solve both the afore-mentioned issues:
> * Propagate bootstrap semantics to the intermediate streams if the
> repartitioned stream is bootstrap stream.
> * Either add end of bootstrap control message (note that all the containers
> have to co-ordinate here) or wait for event time support in Samza.
> This is clearly a problem for Stream-Table join. Consider the scenario where
> the load on the stream is much higher than the load on the stream marked as
> table. Consequently, the number of partitions in the stream will be higher
> than that of the table. To join these two, we will have to increase the
> number of partitions in the table by repartitioning, which is where we end up
> in the premature join of stream with the table, before seeding the table.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)