For this particular use case, you can potentially include the message offset from broker A in the message sent to broker B. If the transaction fails, you read the last message from broker B and use the included offset to resume the consumer from broker A. This assumes that there is only a single client writing to broker B (for a particular partition).
Thanks, Jun On Fri, Oct 26, 2012 at 11:31 AM, Guozhang Wang <g...@cs.cornell.edu> wrote: > I am also quite interested in this thread, and I have another question > here to ask about committing consumed messages. For example, if I need a > program which acts both as a consumer and a producer, and the actions are > wrapped in a "transaction": > > Transaction start: > > Get next message from broker A; > > Do something; > > Send a message to broker B; > > Commit. > > > If the transaction aborts after reading the message from broker A, is it > possible to logically "put the message back" to brokers? I remember that > Amazon Queue Service use some sort of lease mechanism, which might work for > this case. But I am afraid that will affect the throughput a lot.. > > --Guozhang > > > -----Original Message----- > From: Jay Kreps [mailto:jay.kr...@gmail.com] > Sent: Friday, October 26, 2012 2:08 PM > To: kafka-users@incubator.apache.org > Subject: Re: Transactional writing > > This is an important feature and I am interested in helping out in the > design and implementation, though I am working on 0.8 features for the next > month so I may not be of too much use. I have thought a little bit about > this, but I am not yet sure of the best approach. > > Here is a specific use case I think is important to address: consider a > case where you are doing processing of one or more streams and producing an > output stream. This processing may involve some kind of local state (say > counters or other local aggregation intermediate state). This is a common > scenario. The problem is to give reasonable semantics to this computation > in the presence of failures. The processor effectively has a > position/offset in each of its input streams as well as whatever local > state. The problem is that if this process fails it needs to restore to a > state that matches the last produced messages. There are several solutions > to this problem. One is to make the output somehow idempotent, this will > solve some cases but is not a general solution as many things cannot be > made idempotent easily. > > I think the two proposals you give outline a couple of basic approaches: > 1. Store the messages on the server somewhere but don't add them to the log > until the commit call > 2. Store the messages in the log but don't make them available to the > consumer until the commit call > Another option you didn't mention: > > I can give several subtleties to these approaches. > > One advantage of the second approach is that messages are in the log and > can be available for reading or not. This makes it possible to support a > kind of "dirty read" that allows the consumer to specify whether they want > to immediately see all messages with low latency but potentially see > uncommitted messages or only see committed messages. > > The problem with the second approach at least in the way you describe it is > that you have to lock the log until the commit occurs otherwise you can't > roll back (because otherwise someone else may have appended their own > messages and you can't truncate the log). This would have all the problems > of remote locks. I think this might be a deal-breaker. > > Another variation on the second approach would be the following: have each > producer maintain an id and generation number. Keep a schedule of valid > offset/id/generation numbers on the broker and only hand these out. This > solution would support non-blocking multi-writer appends but requires more > participation from the producer (i.e. getting a generation number and id). > > Cheers, > > -Jay > > On Thu, Oct 25, 2012 at 7:04 PM, Tom Brown <tombrow...@gmail.com> wrote: > > > I have come up with two different possibilities, both with different > > trade-offs. > > > > The first would be to support "true" transactions by writing > > transactional data into a temporary file and then copy it directly to > > the end of the partition when the commit command is created. The > > upside to this approach is that individual transactions can be larger > > than a single batch, and more than one producer could conduct > > transactions at once. The downside is the extra IO involved in writing > > it and reading it from disk an extra time. > > > > The second would be to allow any number of messages to be appended to > > a topic, but not move the "end of topic" offset until the commit was > > received. If a rollback was received, or the producer timed out, the > > partition could be truncated at the most recently recognized "end of > > topic" offset. The upside is that there is very little extra IO (only > > to store the official "end of topic" metadata), and it seems like it > > should be easy to implement. The downside is that this the > > "transaction" feature is incompatible with anything but a single > > producer per partition. > > > > I am interested in your thoughts on these. > > > > --Tom > > > > On Thu, Oct 25, 2012 at 9:31 PM, Philip O'Toole <phi...@loggly.com> > wrote: > > > On Thu, Oct 25, 2012 at 06:19:04PM -0700, Neha Narkhede wrote: > > >> The closest concept of transaction on the publisher side, that I can > > >> think of, is using batch of messages in a single call to the > > >> synchronous producer. > > >> > > >> Precisely, you can configure a Kafka producer to use the "sync" mode > > >> and batch messages that require transactional guarantees in a > > >> single send() call. That will ensure that either all the messages in > > >> the batch are sent or none. > > > > > > This is an interesting feature -- something I wasn't aware of. Still it > > > doesn't solve the problem *completely*. As many people realise, it's > > still > > > possible for the batch of messages to get into Kafka fine, but the ack > > from > > > Kafka to be lost on its way back to the Producer. In that case the > > Producer > > > erroneously believes the messages didn't get in, and might re-send > them. > > > > > > You guys *haven't* solved that issue, right? I believe you write about > > it on > > > the Kafka site. > > > > > >> > > >> Thanks, > > >> Neha > > >> > > >> On Thu, Oct 25, 2012 at 2:44 PM, Tom Brown <tombrow...@gmail.com> > > wrote: > > >> > Is there an accepted, or recommended way to make writes to a Kafka > > >> > queue idempotent, or within a transaction? > > >> > > > >> > I can configure my system such that each queue has exactly one > > producer. > > >> > > > >> > (If there are no accepted/recommended ways, I have a few ideas I > would > > >> > like to propose. I would also be willing to implement them if > needed) > > >> > > > >> > Thanks in advance! > > >> > > > >> > --Tom > > > > > > -- > > > Philip O'Toole > > > > > > Senior Developer > > > Loggly, Inc. > > > San Francisco, Calif. > > > www.loggly.com > > > > > > Come join us! > > > http://loggly.com/company/careers/ > > >