Re: Some Questions About Stream Partitions

Chris Riccomini Mon, 09 Dec 2013 09:41:40 -0800

Hi Klaush,

Thanks for your interest. I'm happy to answer any questions you've got.

1) In case of failure, will Samza restore the state automatically on
another node?

Yes. If you are using YARN
(job.factory.class=org.apache.samza.job.yarn.YarnJobFactory), then YARN
will see that a Samza container has failed. It will re-start the Samza
container on a machine in the YARN grid. This node might be the same node
on the grid, or it might be a new node. YARN makes that decision based on
the available resources in the grid. When the Samza container starts up,
Samza will restore the container's state to where it was before the
failure occurred. Once the state has been fully restored, the Samza
container will then start feeding your StreamTask new messages from the
input streams.

2) If I want to scale out and increase the number of stream partitions. How
is the local storage handled? Is it distributed by the framework as well
according to the partitions?

Samza's partitioning model is currently determined by the number of
partitions that a Samza jobs input streams have. A Samza job will always
have the max of the partition counts from all of its input streams. For
example, if a Samza job has two input streams: A and B, and stream A has 4
partitions, and stream B has 8, then the Samza job will have 8 partitions.
The first four partitions (0-3) of the Samza job will receive messages
from both stream A and Stream b, and the last 4 partitions (4-7) will
receive messages ONLY from stream B. These partitions are run physically
inside of Samza "containers". Samza containers are assigned partitions
that they are responsible for processing. For example, if you had two
Samza containers, the first container would process 4 partitions, and the
second container would process the other four. With YARN, the number of
containers you have is defined by the config yarn.container.count. Right
now, you can never have more containers that input stream partitions.

The state for each one of a Samza job's partitions is managed entirely
independently. That is, each Samza partition has its own state store
(LevelDb, if you're using samza-kv). So in the example above, if the
StreamTask were using a single key-value store, there would be 8 LevelDb
stores, one for each Samza partition. This allows us to move partitions
between containers. If you were to decide you wanted 3 containers instead
of 2, Samza would simply cease processing the partitions on the other two
containers, restore the state for the third container, and then begin
processing the partitions across all three containers. This model means
that the maximum parallelism you can get when processing is 1 partition
per container, and up to as many partitions as your input streams have.

Samza does not support resizing an input stream's partition count right
now. Once you start a Samza job, the partition counts for the input
streams are assumed to be static. If you decide you need more parallelism,
you need to start a new Samza job with a different job.name, and
re-process all the input data again.

Cheers,
Chris

On 12/9/13 8:02 AM, "Klaus Schaefers" <[email protected]> wrote:

>Hi,
>
>I have been reading about the Samza and I like the concept behind it a
>lot.
>In particular the local key-value store is a good idea. However I have
>some
>short questions regarding the local state that I couldn't answer while
>reading the web page. I would be very happy if someone could answer them
>shortly. Here they are:
>
>
>1) In case of failure, will Samza restore the state automatically on
>another node?
>
>2) If I want to scale out and increase the number of stream partitions.
>How
>is the local storage handled? Is it distributed by the framework as well
>according to the partitions?
>
>
>Cheers,
>
>Klaus
>
>
>
>
>-- 
>
>-- 
>
>Klaus Schaefers
>Senior Optimization Manager
>
>Ligatus GmbH
>Hohenstaufenring 30-32
>D-50674 Köln
>
>Tel.:  +49 (0) 221 / 56939 -784
>Fax:  +49 (0) 221 / 56 939 - 599
>E-Mail: [email protected]
>Web: www.ligatus.de
>
>HRB Köln 56003
>Geschäftsführung:
>Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
>Dipl.-Wirtschaftsingenieur Arne Wolter
>
>
>
>-- 
>
>-- 
>
>Klaus Schaefers
>Senior Optimization Manager
>
>Ligatus GmbH
>Hohenstaufenring 30-32
>D-50674 Köln
>
>Tel.:  +49 (0) 221 / 56939 -784
>Fax:  +49 (0) 221 / 56 939 - 599
>E-Mail: [email protected]
>Web: www.ligatus.de
>
>HRB Köln 56003
>Geschäftsführung:
>Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
>Dipl.-Wirtschaftsingenieur Arne Wolter

Re: Some Questions About Stream Partitions

Reply via email to