Hi Klaush, Thanks for your interest. I'm happy to answer any questions you've got.
1) In case of failure, will Samza restore the state automatically on another node? Yes. If you are using YARN (job.factory.class=org.apache.samza.job.yarn.YarnJobFactory), then YARN will see that a Samza container has failed. It will re-start the Samza container on a machine in the YARN grid. This node might be the same node on the grid, or it might be a new node. YARN makes that decision based on the available resources in the grid. When the Samza container starts up, Samza will restore the container's state to where it was before the failure occurred. Once the state has been fully restored, the Samza container will then start feeding your StreamTask new messages from the input streams. 2) If I want to scale out and increase the number of stream partitions. How is the local storage handled? Is it distributed by the framework as well according to the partitions? Samza's partitioning model is currently determined by the number of partitions that a Samza jobs input streams have. A Samza job will always have the max of the partition counts from all of its input streams. For example, if a Samza job has two input streams: A and B, and stream A has 4 partitions, and stream B has 8, then the Samza job will have 8 partitions. The first four partitions (0-3) of the Samza job will receive messages from both stream A and Stream b, and the last 4 partitions (4-7) will receive messages ONLY from stream B. These partitions are run physically inside of Samza "containers". Samza containers are assigned partitions that they are responsible for processing. For example, if you had two Samza containers, the first container would process 4 partitions, and the second container would process the other four. With YARN, the number of containers you have is defined by the config yarn.container.count. Right now, you can never have more containers that input stream partitions. The state for each one of a Samza job's partitions is managed entirely independently. That is, each Samza partition has its own state store (LevelDb, if you're using samza-kv). So in the example above, if the StreamTask were using a single key-value store, there would be 8 LevelDb stores, one for each Samza partition. This allows us to move partitions between containers. If you were to decide you wanted 3 containers instead of 2, Samza would simply cease processing the partitions on the other two containers, restore the state for the third container, and then begin processing the partitions across all three containers. This model means that the maximum parallelism you can get when processing is 1 partition per container, and up to as many partitions as your input streams have. Samza does not support resizing an input stream's partition count right now. Once you start a Samza job, the partition counts for the input streams are assumed to be static. If you decide you need more parallelism, you need to start a new Samza job with a different job.name, and re-process all the input data again. Cheers, Chris On 12/9/13 8:02 AM, "Klaus Schaefers" <[email protected]> wrote: >Hi, > >I have been reading about the Samza and I like the concept behind it a >lot. >In particular the local key-value store is a good idea. However I have >some >short questions regarding the local state that I couldn't answer while >reading the web page. I would be very happy if someone could answer them >shortly. Here they are: > > >1) In case of failure, will Samza restore the state automatically on >another node? > >2) If I want to scale out and increase the number of stream partitions. >How >is the local storage handled? Is it distributed by the framework as well >according to the partitions? > > >Cheers, > >Klaus > > > > >-- > >-- > >Klaus Schaefers >Senior Optimization Manager > >Ligatus GmbH >Hohenstaufenring 30-32 >D-50674 Köln > >Tel.: +49 (0) 221 / 56939 -784 >Fax: +49 (0) 221 / 56 939 - 599 >E-Mail: [email protected] >Web: www.ligatus.de > >HRB Köln 56003 >Geschäftsführung: >Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, >Dipl.-Wirtschaftsingenieur Arne Wolter > > > >-- > >-- > >Klaus Schaefers >Senior Optimization Manager > >Ligatus GmbH >Hohenstaufenring 30-32 >D-50674 Köln > >Tel.: +49 (0) 221 / 56939 -784 >Fax: +49 (0) 221 / 56 939 - 599 >E-Mail: [email protected] >Web: www.ligatus.de > >HRB Köln 56003 >Geschäftsführung: >Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, >Dipl.-Wirtschaftsingenieur Arne Wolter
