dnishimura commented on a change in pull request #1179: SAMZA-2334: 
SSPGrouperProxy should check if job is stateful, not if host-affinity is enabled
URL: https://github.com/apache/samza/pull/1179#discussion_r333105398
 
 

 ##########
 File path: docs/learn/documentation/versioned/jobs/samza-configurations.md
 ##########
 @@ -77,9 +77,10 @@ These are the basic properties for setting up a Samza 
application.
 |job.coordinator.replication.<br>factor|300000|The frequency at which the 
input streams' partition count change should be detected. When the input 
partition count change is detected, Samza will automatically restart a 
stateless job or fail a stateful job. A longer time interval is recommended for 
jobs w/ large number of input system stream partitions, since gathering 
partition count may incur measurable overhead to the job. You can completely 
disable partition count monitoring by setting this value to 0 or a negative 
integer, which will also disable auto-restart/failing behavior of a Samza job 
on partition count changes.|
 
|job.systemstreampartition.<br>grouper.factory|`org.apache.samza.`<br>`container.grouper.stream.`<br>`GroupByPartitionFactory`|A
 factory class that is used to determine how input SystemStreamPartitions are 
grouped together for processing in individual StreamTask instances. The factory 
must implement the SystemStreamPartitionGrouperFactory interface. Once this 
configuration is set, it can't be changed, since doing so could violate state 
semantics, and lead to a loss of 
data.<br><br>`org.apache.samza.container.grouper.stream.`<br>`GroupByPartitionFactory`<br>Groups
 input stream partitions according to their partition number. This grouping 
leads to a single StreamTask processing all messages for a single partition 
(e.g. partition 0) across all input streams that have a partition 0. Therefore, 
the default is that you get one StreamTask for all input partitions with the 
same partition number. Using this strategy, if two input streams have a 
partition 0, then messages from both partitions will be routed to a single 
StreamTask. This partitioning strategy is useful for joining and aggregating 
streams.<br><br>`org.apache.samza.container.grouper.stream.`<br>`GroupBySystemStreamPartitionFactory`<br>Assigns
 each SystemStreamPartition to its own unique StreamTask. The 
GroupBySystemStreamPartitionFactory is useful in cases where you want increased 
parallelism (more containers), and don't care about co-locating partitions for 
grouping or joins, since it allows for a greater number of StreamTasks to be 
divided up amongst Samza containers.|
 |job.systemstreampartition.<br>matcher.class| |If you want to enable static 
partition assignment, then this is a required configuration. The value of this 
property is a fully-qualified Java class name that implements the interface 
org.apache.samza.system.SystemStreamPartitionMatcher. Samza ships with two 
matcher 
classes:<br><br>`org.apache.samza.system.RangeSystemStreamPartitionMatcher`<br>This
 classes uses a comma separated list of range(s) to determine which partition 
matches, and thus statically assigned to the Job. For example "2,3,1-2", 
statically assigns partition 1, 2, and 3 for all the specified system and 
streams (topics in case of Kafka) to the job. For config validation each 
element in the comma separated list much conform to one of the following 
regex:<br>`(\\d+)`" or"`(\\d+-\\d+)`"<br>`JobConfig.SSP_MATCHER_CLASS_RANGE` 
constant has the canonical name of this 
class.<br><br>`org.apache.samza.system.RegexSystemStreamPartitionMatcher`<br>This
 classes uses a standard Java supported regex to determine which partition 
matches, and thus statically assigned to the Job. For example "[1-2]", 
statically assigns partition 1 and 2 for all the specified system and streams 
(topics in case of Kafka) to the job. JobConfig.SSP_MATCHER_CLASS_REGEX 
constant has the canonical name of this class.|
-|job.systemstreampartition.<br>matcher.config.<br>range| |If 
`job.systemstreampartition.matcher.class` is specified, and the value of this 
property is `org.apache.samza.system.RangeSystemStreamPartitionMatcher`, then 
this property is a required configuration. Specify a comma separated list of 
range(s) to determine which partition matches, and thus statically assigned to 
the Job. For example "2,3,11-20", statically assigns partition 2, 3, and 11 to 
20 for all the specified system and streams (topics in case of Kafka) to the 
job. A singel configuration value like "19" is valid as well. This statically 
assigns partition 19. For config validation each element in the comma separated 
list much conform to one of the following regex:<br>"`(\\d+)`" or 
"`(\\d+-\\d+)`"|
+|job.systemstreampartition.<br>matcher.config.<br>range| |If 
`job.systemstreampartition.matcher.class` is specified, and the value of this 
property is `org.apache.samza.system.RangeSystemStreamPartitionMatcher`, then 
this property is a required configuration. Specify a comma separated list of 
range(s) to determine which partition matches, and thus statically assigned to 
the Job. For example "2,3,11-20", statically assigns partition 2, 3, and 11 to 
20 for all the specified system and streams (topics in case of Kafka) to the 
job. A single configuration value like "19" is valid as well. This statically 
assigns partition 19. For config validation each element in the comma separated 
list much conform to one of the following regex:<br>"`(\\d+)`" or 
"`(\\d+-\\d+)`"|
 |job.systemstreampartition.<br>matcher.config.<br>regex| |If 
`job.systemstreampartition.matcher.class` is specified, and the value of this 
property is `org.apache.samza.system.RegexSystemStreamPartitionMatcher`, then 
this property is a required configuration. The value should be a valid Java 
supported regex. For example "[1-2]", statically assigns partition 1 and 2 for 
all the specified system and streams (topics in case of Kakfa) to the job.|
 |job.systemstreampartition.<br>matcher.config.<br>job.factory.regex| |This 
configuration can be used to specify the Java supported regex to match the 
StreamJobFactory for which the static partition assignment should be enabled. 
This configuration enables the partition assignment feature to be used for 
custom StreamJobFactory(ies) as well.<br>This config defaults to the following 
value: "_org\\\\.apache\\\\.samza\\\\.job\\\\.local(.*ProcessJobFactory &#124; 
.*ThreadJobFactory)_", which enables static partition assignment when 
job.factory.class is set to `org.apache.samza.job.local.ProcessJobFactory` or 
`org.apache.samza.job.local.ThreadJobFactory`.|
+|job.systemstreampartition.<br>grouper.proxy.enabled|true|When enabled, this 
provides a proxy layer on top of the grouper configured by 
`job.systemstreampartition.grouper.factory` that allows stateful jobs to expand 
or contract their partition count by a multiple of the previous count so that 
events from an input stream partition are not stored in the local state of a 
different task. This will prevent erroneous results. If this configuration is 
disabled or if the job is stateless, the proxy will call the grouper directly 
and skip the expansion or contraction logic. See 
[SEP-5](https://cwiki.apache.org/confluence/display/SAMZA/SEP-5%3A+Enable+partition+expansion+of+input+streams)
 for more details.|
 
 Review comment:
   Sure will change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to