[ 
https://issues.apache.org/jira/browse/SAMZA-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963169#comment-13963169
 ] 

Chris Riccomini commented on SAMZA-226:
---------------------------------------

bq. AFAIK we're not creating any Kafka topics explicitly in hello-samza. Doing 
so would actually be a bit tricky, because of an issue mentioned in this 
comment on SAMZA-152: ZK and Kafka broker need to be up before topics can be 
created, but a script can't tell whether that is the case. See also this email.

So it sounds like we're just lucking out and using the default partition count 
for both the state job in hello-samza, and also its changelog topic. I'm 
guessing we're not turning on log compaction, either. This means that the 
changelog topic in hello-samza is having its oldest log segments deleted 
(time-based retention), which could lead to data loss, since an old key would 
be dropped and never re-added. That said, for the demo, it's OK for now. It 
will get fixed as part of this ticket.

bq. I think it would be good to add stream creation to the SystemAdmin 
interface. I prefer SystemAdmin.createChangelogStream over leaking 
Kafka-specific configuration into samza-core. The state changelog is a 
fundamental concept in Samza (whereas Kafka is supposed to be completely 
pluggable), so I don't think it's a problem to have a method like 
SystemAdmin.createChangelogStream in the Samza API.

I'm leaning this way, as well. Let the individual system decide what the 
appropriate configurations are for a changelog stream.

> Auto-create changelog streams for kv
> ------------------------------------
>
>                 Key: SAMZA-226
>                 URL: https://issues.apache.org/jira/browse/SAMZA-226
>             Project: Samza
>          Issue Type: Bug
>          Components: container, kv
>    Affects Versions: 0.6.0
>            Reporter: Chris Riccomini
>
> Currently, changelog topics are not auto-created. This is a frustrating user 
> experience, and there are a few useful defaults that should be set that are 
> not obvious when creating Kafka topics with log compaction enabled.
> We should have Samza auto-create changelog streams for the kv stores that 
> have changelogs enabled.
> In Kafka's case, the changelog topics should be created with compaction 
> enabled. They should also be created with a smaller (100mb) default 
> [segment.bytes|http://kafka.apache.org/documentation.html#configuration] 
> setting. The smaller segment.bytes setting is useful for low-volume 
> changelogs. The problem we've seen in the past is that the default 
> log.segment.bytes is 1 gig. Kafka's compaction implementation NEVER touches 
> the most recent log segment. This means that, if you have a very small state 
> store, but execute a lot of deletes/updates (e.g. you've only got maybe 25 
> megs of active state, but are deleting and updating it frequently), you will 
> always end up with at LEAST 1 gig of state to restore (since the most recent 
> segment will always contain non-compacted writes). This is silly since your 
> active (compacted) state is really only ~25 megs. Shrinking the segment bytes 
> means that you'll have a smaller maximum data size to restore. The trade off 
> here is that we'll have more segment files for changelogs, which will 
> increase file handles.
> The trick is doing this in a generic way, since we are supporting changelogs 
> for more than just Kafka systems. I think the interface to do the stream 
> creation belongs in the SystemAdmin interface. It would be nice to have a 
> generic SystemAdmin.createStream() interface, but this would require giving 
> it kafka-specific configuration. Another option is to have 
> SystemAdmin.createChangelogStream, but this seems a bit hacky at first 
> glance. We need to think this part through.
> [~martinkl], in hello-samza, how are we creating log compacted state stores 
> with the appropriate number of partitions? Is this handled as part of 
> bin/grid?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to