Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Samza Wiki" for change 
notification.

The "Pluggable MessageChooser" page has been changed by ChrisRiccomini:
https://wiki.apache.org/samza/Pluggable%20MessageChooser?action=diff&rev1=9&rev2=10

   * Choose the message with the smallest offset.
   * Choose the message from the stream with the most unprocessed messages.
   * Choose the message from the stream that is the farthest % behind head.
+  * Choose the message that keeps streams most aligned with their alignment at 
start time.
  
  ==== Offsets ====
  
@@ -176, +177 @@

  
  ==== Messages behind head ====
  
- I need to think about this a bit more.
+ This approach would require changing the API to allow the !MessageChooser to 
know how far behind a message was from head. 
+ 
+ The behavior of this strategy is not desirable in cases where we have one 
very high throughput stream, and one very low throughput stream. In such a 
case, the low throughput stream might be 100 messages behind, which might be 
equivalent to two days in wall-clock time. The high throughput stream might be 
1000 messages behind, which might be equivalent to 10 seconds behind, in 
wall-clock time. Using this strategy, the high throughput stream's message 
would be picked. This is counter intuitive, since we're trying to find a proxy 
for time alignment, and this approach is actually not aligning by time in this 
scenario.
  
  ==== Percent behind head ====
  
- I need to think about this a bit more.
+ This approach would require changing the API to allow the !MessageChooser to 
know (or derive) what percentage behind head a message is.
+ 
+ In cases where you have two input created at dramatically different points in 
time (for example a year ago, and a day ago), the percentage behind head is a 
misleading measurement. 50% behind a stream that was created a year ago means 
you have half a year's worth of messages to process. 90% behind a stream that 
was created a day ago means you have 21 hours of messages to process. In this 
scenario, this strategy would pick messages from the stream created yesterday, 
even though it's actually much closer to "now" in wall-clock time. This, again, 
is counter intuitive, since our goal is to find a proxy for time-alginment.
+ 
+ ==== Maintain starting alignment ====
+ 
+ This approach would require changing the API to allow the !MessageChooser to 
know how far a message was from head.
+ 
+ It appears that we can't come up with a good general-purpose proxy for time 
alignment. In the absence of such a strategy, the next best thing to do seems 
to be just to guarantee maintaining the alignment of the streams that existed 
before the Samza job started. 
+ 
+ For example, take the case where there are two input streams at job-start 
time: one that's 100 messages behind, and the other that's 1000 messages 
behind. There are three possible states for the streams to be in with this 
example: both streams are behind their starting alignments (> 100 messages and 
> 1000 messages behind, respectively), one stream is behind and the other is 
ahead of its starting alignment (> 100 messages behind and < 1000 messages 
behind), or both streams are ahead of their starting alignment. The strategy 
for the !MessageChooser then becomes:
+ 
+  1. All behind: pick a message from the stream that is farthest behind its 
original alignment (in terms of number of messages).
+  2. Partially behind: pick a message from the stream that is farthest behind 
its original alignment (in terms of number of messages).
+  3. All ahead: pick a message from the stream that is closest to its original 
alignment (in terms of number of messages).
+ 
+ Choosing messages when all streams are ahead of their original alignments (3) 
is an interesting case. Taking our original example, if the 
100-message-behind-stream is now 90 messages behind head, then it is 10 
messages from its original alignment. If the 1000-messages-behind-stream is now 
900 messages behind head, then it is 100 messages from its original alignment. 
The !MessageChooser would pick the stream that is 10 messages ahead of its 
original alignment, because it is deemed to be "closest" to its original 
alignment.
+ 
+ I'm going to declare this strategy as out of scope, but open up a JIRA to 
track it.
  
  === DefaultChooser ===
  

Reply via email to