Aditya created SAMZA-1613:
Summary: Stream-table join: Intermediate streams do not inherit
bootstrap stream semantic
Issue Type: Bug
There are two issues with repartitioning a bootstrap stream:
* Samza does not propagate the bootstrap semantics to the intermediate stream.
* For a bootstrap stream, Samza gets the newest offset for that stream before
consumption and considers the bootstrap to be complete once the message with
that offset is consumed. This is clearly a problem for intermediate streams as
there might not be any messages at the beginning of the job.
Even though the first stage abides by the bootstrap semantics and does not
consume from non-bootstrap streams until all the bootstrap stream partitions
owned by that container are re-partitioned, there will be scenarios where one
container finishes bootstrap faster than others and also the rate of
consumption from the intermediate stream in the second stage might be slower.
All these would result in the job breaking the bootstrap semantics across the
two stages. Please note that this issue is not just specific to repartitioning
but could happen with any intermediate streams.
To support this, we will have to solve both the afore-mentioned issues:
* Propagate bootstrap semantics to the intermediate streams if the
repartitioned stream is bootstrap stream.
* Either add end of bootstrap control message (note that all the containers
have to co-ordinate here) or wait for event time support in Samza.
This is clearly a problem for Stream-Table join. Consider the scenario where
the load on the stream is much higher than the load on the stream marked as
table. Consequently, the number of partitions in the stream will be higher than
that of the table. To join these two, we will have to increase the number of
partitions in the table by repartitioning, which is where we end up in the
premature join of stream with the table, before seeding the table.
This message was sent by Atlassian JIRA