Ben, This is fantastic! I took a brief look at the project. It's very cool. A few notes:
# Do you plan to automate the Samza config generation portion, so that you can run the full flow from a single command? # What can we do to help? Regarding (2), clearly supporting exactly-once messaging at the Samza level would solve the problem for you. Unfortunately, time/resource constraints are preventing us from adding the feature at the moment. Not sure when Kafka is going to roll it in. Anyway, let me try and answer your questions: > I'm also particularly interested in supporting exactly-once semantics >where possible. This work is coming along, but there are a few cases >where I'm fighting against Samza's regular primitives, and it's made >things more complex or less performant than I'd like. Of course, this is >a problem of my own devising; but I'd be quite grateful if anyone has >comments or suggestions. Samza, itself, will handle exactly-once messaging at some point, but it depends on yet-to-be-implemented Kafka features. Are you saying that you're trying to implement exactly-once messaging on top of Samza? If so, yes, this will be pretty tricky, and likely impact performance. I believe Trident's exactly-once performance suffers for this reason, as well. We looked at doing things this way, but opted for the Kafka-based approach. > Should I create the system myself from config during task init, or is >there a better way? Yea, for now, you'll have to create the system yourself. One thing that I'd been batting around was the idea of exposing all of the objects via the SamzaContainerContext, but that's far off. > If I make both the output and changelog streams synchronous, I think >this should already work -- as long as I make the calls in the correct >order. (Right?) I believe so, yes. > Am I missing anything? If not, would it be worth adding support for this? You're not missing anything. This seems like a worth-while feature, and is similar to the TaskCoordinator.shutdown() work that Martin did to add a request scope. Can you open a JIRA for this? Cheers, Chris On 11/4/14 11:06 AM, "Ben Kirwin" <[email protected]> wrote: >Hello, > >Previously on this list, I mentioned I was working on a high-level >streaming framework, and trying to piggyback it on top of Samza. This >work has been going fairly well; the flow graph is compiling down to >Samza configs, and they run quite nicely on the hello-samza test >cluster. It's still very incomplete, but if you're feeling >adventurous: the project is called 'coast', and the code's up on >GitHub. > >https://github.com/bkirwi/coast/tree/samza >https://github.com/bkirwi/incubator-samza-hello-samza/tree/hello-coast > >More links, examples, and bootstrap instructions are in the READMEs, >and I'm happy to answer any questions. > >But all this is to say: I've been digging fairly deeply into Samza's >structure and implementation, and it's given me some questions of my >own. > >By default, the posted version of coast does not preserve any offsets >or state. Enabling Samza's checkpoint / changelog is straightforward, >but I'm also particularly interested in supporting exactly-once >semantics where possible. This work is coming along, but there are a >few cases where I'm fighting against Samza's regular primitives, and >it's made things more complex or less performant than I'd like. Of >course, this is a problem of my own devising; but I'd be quite >grateful if anyone has comments or suggestions. > >1) A few weeks ago, the idea of a 'replayable message chooser' was >being thrown around: > >http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201409.mbox/% >3CD02B568A.318B1%25criccomini%40linkedin.com%3E > >https://issues.apache.org/jira/browse/SAMZA-405 > >I've been implementing something akin to this, but it ends up being >fairly invasive -- it interacts in a tricky way with the checkpointing >mechanism, so it doesn't fit nicely behind the MessageChooser >interface. I suspect it would be cleaner with framework support, but >I'm not sure if there's broad enough interest to justify it vs. >waiting for Kafka to have transactional messaging. If there is, I'd be >happy to start an implementation discussion on the ticket. > >2) To avoid sending certain messages twice, the StreamTask needs to >check its output stream on startup to see what the 'latest' offset is. >It looks like I can get this information from a SystemAdmin instance, >but this doesn't seem to be accessible to user code. Should I create >the system myself from config during task init, or is there a better >way? > >3) To be sure the state is recoverable after a crash, I need to make >sure data is persisted in a particular order. The messages need to be >sent before the state is changelogged, and both need to be complete >before checkpointing the offsets. (Happy to go into more detail on >this if you like.) If I make both the output and changelog streams >synchronous, I think this should already work -- as long as I make the >calls in the correct order. (Right?) > >But this is slow, since it requires multiple blocking network >roundtrips for every message. It should be more performant to buffer >all the changes, and then trigger a flush: first for the messages, >then the state changelog, then the offset checkpoints. Right now, >though, I don't think this is possible; the commit logic does send the >messages before triggering a checkpoint, but I can't find a way to >ensure certain output streams are flushed before others, or to flush a >single output stream explicitly. Am I missing anything? If not, would >it be worth adding support for this? > >That's all for now, I think. Thanks again; and let me know if you'd >like me to elaborate on anything. > >-- Ben
