Thanks Sijie for supply more detail information about kafka.

so 1) and 2) are not two opposite solution. they could be done together with 
same changes.
-------------
Yes, you are right, changes in server side is quite similar. I said 2) requires 
large change is because I thought we should include works on recording consume 
sequence id in client side but it seems it's the App's responsibility now.
I'll create JIRA for it.

Really thanks for your guys who join this discussion.

Regards,
Jiannan


From: Sijie Guo <[email protected]<mailto:[email protected]>>
Date: Sunday, February 24, 2013 2:33 AM
To: 
"[email protected]<mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]>>,
 
"[email protected]<mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]>>,
 "Yahoo! Inc." <[email protected]<mailto:[email protected]>>
Cc: Hang Qi <[email protected]<mailto:[email protected]>>, Hongjian Chen 
<[email protected]<mailto:[email protected]>>, Bizhu Qiu 
<[email protected]<mailto:[email protected]>>, Fangmin Lv 
<[email protected]<mailto:[email protected]>>, Lin Shen 
<[email protected]<mailto:[email protected]>>
Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig




On Sat, Feb 23, 2013 at 2:15 AM, Jiannan Wang 
<[email protected]<mailto:[email protected]>> wrote:
Hi Sijie,
   Thanks for well explaining on the difference between pub/sub model and queue 
model, I did confuse on them when there is only one subscriber on topic, I just 
want to invoke queue semantic to get around the problem :)

--------------------
two ideas could be proceed to resolve it (similar as what kafka did):
1) have a subscription option to indicate subscribe starting from the latest 
sequence id or the oldest sequence id.
2) let subscriber managed its consumed ptr and passed the consumed ptr back 
when subscribe to tell hub server where to start delivery. this subscriber 
could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):
a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already 
consumed messages again.
--------------------
I see the ConsumerConfig class in kafka's api but cannot find related option.

sorry that I don't describe clearly. kafka let consumer maintains the consumer 
ptr rather than the server side.
You could check 1) 'Simple Consumer' section here: 
http://kafka.apache.org/quickstart.html , 2) 'Consume State' section here: 
http://kafka.apache.org/design.html


For idea 1), we also need to change current message garbage collection behavior 
in Hedwig: for topic with no subscriber just keep the message with messageBound 
limit. I in favor of this solution.
idea 2) is cool though it requires large changes compare to 1).

Neither 1) nor 2) requires big changes.

for 1), we could simply have an option 'whence' in SubscriptionOption, 
indicating when to start subscribe, which have two options: OLDEST, LATEST. so 
when it is first-time subscription, we picked oldest or latest message as the 
consume ptr for this subscription.

for 2), we could have an optional option 'consumedseqid' in SubscriptionOption. 
if the subsriber provides such option, we used this provided 'consumedseqid' as 
the consume ptr, if the 'consumedseqid' is smaller than the oldest message, we 
should move the pointer to the oldest message, and if the 'consumedseqid' is 
larger than the latest message, we should move the pointer to the latest one. 
if the subscriber doesn't provide such option, we could fall back to normal 
case and apply 1).

for completeness that I described before for one benefit to eliminate storage 
for metadata is having a special kind of subscriber (having a subscription 
option, 'inmemsubscription', indicating it is just an inmemory subscription, 
hub server just put this subscription in memory during its lifetime.). 
Leveraging above two options, we could have the subscriber maintains the 
subscription state and passed it back when subscribed.

Both 1) and 2) we need to do following things:

a) change the garbage collection policy to keep messages aligned with 
messageBound limitation.
b) read the oldest message seq id from persistence manager. this is the core 
part we need to improve to achieve 'subscribe the oldest' semantic. one place 
we need to take care when reading the oldest message seq id: we could not 
simply use the first seq id in LedgerRanges, since the first ledger might 
already deleted but not removed from ledger ranges metadata. (it is caused 
because there is no transaction between ledger metadata and hedwig metadata).

so 1) and 2) are not two opposite solution. they could be done together with 
same changes.



I see Flavio's reply to Yannick which suggests using ZooKeeper to coordinate 
the actions of publisher and subscriber. But it's a client-side solution, I 
would prefer solution 1) in Sijie's proposal which requires no special works in 
client side.

Thanks,
Jiannan


From: Sijie Guo <[email protected]<mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]>>
Date: Thursday, February 21, 2013 4:50 PM
To: 
"[email protected]<mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]>>
Cc: 
"[email protected]<mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]>>,
 Hang Qi <[email protected]<mailto:[email protected]>>, Hongjian Chen 
<[email protected]<mailto:[email protected]>>, Bizhu Qiu 
<[email protected]<mailto:[email protected]>>, Fangmin Lv 
<[email protected]<mailto:[email protected]>>, Lin Shen 
<[email protected]<mailto:[email protected]>>

Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Thanks Jiannan for raising the discussion of queue semantic. There was some 
other guys in the mail list asked for queue semantic before.

Basically, topic (pub/sub) is quite different from queue in messaging concepts. 
In pub/sub model, when a publisher publish a message, it goes to all the 
consumers (subscribers) who are interested; while a queue model implements a 
load balancer semantic. A single message would be consumed almost exactly by 
one consumer. It means that a queue has many consumers with messages load 
balanced across the available consumers.

If the application requires all consumers seen same view of published messages, 
a topic is better for it. If the application doesn't matter who would receive 
and consume the published messages, a queue is better. But these two concepts 
become similar when there are only one consumer. It might make you confused on 
using a queue or a topic.

for your case, it is still a pub/sub application. so your first question is how 
to handle this case gracefully in a pub/sub model. two ideas could be proceed 
to resolve it (similar as what kafka did):

1) have a subscription option to indicate subscribe starting from the latest 
sequence id or the oldest sequence id.

2) let subscriber managed its consumed ptr and passed the consumed ptr back 
when subscribe to tell hub server where to start delivery. this subscriber 
could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):

a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already 
consumed messages again.

for the garbage collection stuff you mentioned on how long to keep the 
messages, we already have messageBound to limit the length of a topic. We don't 
need to worry about it.

for your second question, it might be nice to have the queue semantic in 
Hedwig, since JMS implementation needs it. But implementing the queue semantic 
is totally a different story than pub/sub.

-Sijie


On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang 
<[email protected]<mailto:[email protected]>> wrote:
Hi guys,
   Under current Hedwig semantic, a subscriber cannot aware of messages 
published before he subscribes the topic. So in following example, subscriber A 
can only receives messages after seqId 2.
---------------------------------
Topic T: msg1 msg2 msg3 msg4 ...
                     | <- subscriber A subscribe the topic
---------------------------------

   This semantic is very reasonable, but Hedwig client needs to handle this 
corner case: a new topic is just to be created, and as topic is lazily created 
by the first request (generally it's PUB or SUB), so the client side must 
coordinate between publisher and subscriber to make sure the first SUB is 
handled before the first PUB at this very beginning status (consider subscriber 
may have very bad network connection which causes SUB failed and user does not 
want to miss any messages). In summary, it requires special works if there is a 
subscriber would like to receive all the messages since topic is created, and I 
think this requirement is very general.

   Handle this problem in client side is a choice, but I think maybe we can 
simply resolve it  in server side if Hedwig can support queue semantic (so that 
we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). 
And as I known, the major concern on queue semantic is how long to keep the 
messages, however:
   1. It is user's responsibility to know about the feature and impact of queue 
semantic.
   2. On the other hand, we can add a parameter to limit the queue length.

   In a word, here are the two problem I would like to discuss:
   1. How to gracefully resolve the above issue in server side under current 
semantic.
   2. Whether or not to introduce queue semantic into Hedwig.

Thanks,
Jiannan


Reply via email to