I’m looking at the samza-core source code… in particular RunLoop, TaskInstance, 
TaskInstanceCollector, and SystemProducers, and I’m having a hard time seeing 
where this batched sends of output messages happens.  It seems from RunLoop 
that:

- one input message envelope is read from the consumer multiplexer
- it is sent to the task for processing
- if the task writes a new output message envelope to the collector, the 
TaskInstanceCollector immediately looks up the SystemProducer from 
SystemProducers and immediately sends it onwards.

I don’t see any consultation of either producer.type or batch size.

However, I am also new to scala, so I’m likely not understanding something 
fairly obvious.

Richard


> On Feb 19, 2015, at 8:39 AM, Chris Riccomini <criccom...@apache.org> wrote:
>
> Hey Tom,
>
> It seems that most of your questions are concerned with durability and
> messaging guarantees. Samza is designed to not lose data, but duplicates
> can occur. Samza reads messages, and feeds them to your process() method.
> When you send messages, either via a changelog, or via collector.send,
> Samza will batch those messages up, and send them at some point BEFORE your
> input offsets are committed. This looks like:
>
> <start>, ... process and send a lot ..., <commit>
>
> Samza only guarantees that everything will be flushed to Kafka (or whatever
> output system you're sending to) *before* committing offsets. Once offsets
> are committed, you'll never see any prior messages again. If a failure
> occurs somewhere *before* the offsets are committed, you'll simply fall
> back to the last checkpointed offsets (<start>) and restart the processing
> again.
>
> In between, for performance reasons, Samza batches output, delays sends,
> etc. This is safe because we always flush before committing.
>
>> a) If using RockDB kv implementation, is there a way to guarantee that a
> put is committed (at least on that instance disc), I notice that RockDB
> implementation does nothing for kv.flush().
>
> The RocksDB store in Samza is basically used as a durable cache. The only
> guarantee that Samza really cares about is whether it can get the data
> after it's been put (whether the data is still in memory, or on disk). The
> guarantee you, as a user, probably care about is whether your write has
> been sent to your changelog.
>
>> b) When is it guaranteed that the kv put is in the change log (I am using
> kafka implementation).
>
> It will be guaranteed to be written to the changelog when commit() is
> called, before your offsets are committed. The exact order of commit is:
> flush storage changelogs, flush producers, commit offsets. You can see this
> in RunLoop.scala. This guarantees that your changelogs will be fully
> flushed to Kafka before you commit your offsets. If a failure occurs before
> the offset commit, you'd see duplicate messages, but you'd never lose
> messages.
>
>> When using messageCollector.send and 
>> systems.kafka.producer.producer.type=sync
> does that guarantee that the message is in kafka log when the send returns.
>
> Note quite. Samza batches messages to increase throughput. 'sync' tells
> Samza to block when a *batch* of messages is being sent. If you wanted to
> synchronously write each message, and block, you'd have to set the batch
> size to 1.
>
>> If my Samza job fails while processing a message, I fix it and deploy
> again, will the message offset still point to a value <= the message I
> failed on.
>
> Yes. It should never be higher until the commit() message is called (after
> process()). The guarantee Samza provides is that you might see duplicates,
> but you'll not lose data.
>
> Cheers,
> Chris
>
> On Thu, Feb 19, 2015 at 8:23 AM, Tom Dearman <tom.dear...@gmail.com> wrote:
>
>> Hi,
>> Can someone help with the following questions please:
>>
>> a) If using RockDB kv implementation, is there a way to guarantee that a
>> put is committed (at least on that instance disc), I notice that RockDB
>> implementation does nothing for kv.flush().
>>
>> b) When is it guaranteed that the kv put is in the change log (I am using
>> kafka implementation).
>>
>> c) When using messageCollector.send and
>> systems.kafka.producer.producer.type=sync does that guarantee that the
>> message is in kafka log when the send returns.  I am new to kafka, but it
>> seems to me that if you have type=sync set, you still need to wait for the
>> future objects get to return,  is this what Samza does?
>>
>> d) If my Samza job fails while processing a message, I fix it and deploy
>> again, will the message offset still point to a value <= the message I
>> failed on.  ie I understand it can be earlier, but is it possible the
>> offset will now point to one higher.


________________________________

This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.

Reply via email to