Hi Sohaib, Sorry for the late reply, we could move this project forward now ~
``` I would at some point like to post design ideas to this problem privately to get it reviewed by the development community but not make it publicly available so that it cannot be plagiarised. ``` You can send your design ideas to me directly or to our PMC list( priv...@rocketmq.apache.org) if you want to make your ideas privately. But please don't break away from the community. I hope you have already understood the goal of this project. Now, RocketMQ support At-least-once delivery, it's an obvious solution that achieves Exactly-Once by removing duplicated messages. Return to your original questions: 1. What defines a redundant message? A message id will be generated when new a message, so this id can be used to identify a message. Also, the user could specify a unique business-related property to identify a message. The redundant messages will occur when the network is broken or reconnected, rebalance[1] is triggered, etc. 2. Is their a timeline on the redundant messages? Yes, keep all messages nonredundant is expensive, let's consider this question within a certain time window ~ Looking forward to your design. [1]. https://github.com/apache/rocketmq/blob/master/client/src/main/java/org/apache/rocketmq/client/impl/consumer/RebalanceService.java Regards, yukon On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <sohaib1...@gmail.com> wrote: > @Zhanhui Thanks for the response. This is not a campaign its just part of > GSoC (https://summerofcode.withgoogle.com/). And community help is gladly > welcomed. In fact, it is recommended :) > > @KaiYuan Thanks for your suggestions. I will come up with a flow chart for > the proposed solution this weekend. > > Thanks, > Sohaib > > > On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <lizhan...@gmail.com> wrote: > > > Hi Sohaib, > > > > I have been sort of busy this these days. Sorry to reply you so late! > > > > So sure what “deadline” you are referring to. If this is part of a > > campaign, I have to admit I am not aware of the regulations and what kind > > of help I should offer to maintain fairness considering other arising > > similar issues. > > > > Regards! > > > > Zhanhui Li > > > > > > > 在 2018年3月1日,上午3:43,Sohaib Iftikhar <sohaib1...@gmail.com> 写道: > > > > > > Hi guys, > > > > > > Would be nice to have some feedback on this as the deadline is not too > > far :) > > > > > > Thanks, > > > Sohaib > > > > > > Regards, > > > Sohaib Iftikhar > > > > > > -- Man is still the most extraordinary computer of all.-- > > > > > > > > > On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar < > sohaib1...@gmail.com > > <mailto:sohaib1...@gmail.com>> wrote: > > > Thank you for the pointers to the code. This was super helpful. The > > multiple keys can probably be serialized better than separating them > with a > > space but that is already legacy I suppose. > > > > > > Firstly filters like bloom or cuckoo are heuristic. They can help make > > things faster but definitely cannot be used as the only solution. Hence, > in > > the end, we will still need a persistent keystore/distributed set. My > plan > > was to have this keystore as distributed (raft guarantee etc.). The > > keystore can also hold a persistent filter on its end. If a broker > > collapses it can renew/refresh its filter from the keystore. Hence > > eliminating the problems about crashes that you mention. The problem here > > could be in maintaining performance for filters in case of removals from > > the keystore (for eg: sliding windows as mentioned in my previous mail). > > Periodic refreshal of filters can help solve this but I am open to > > suggestions on how to make this better. > > > > > > I think implementing a distributed set on the client cluster has its > > caveats. The way I understand RocketMQ is that we do not have control > over > > the diskspace/memory on the client end. So we probably only have a > constant > > amount. A distributed set on the client would also need to be persistent. > > For eg: if a client restarts/recovers etc. This basically means we need a > > keystore on the client instead of the broker cluster. This probably puts > > too much responsibility on the client cluster. A different approach would > > be to ensure that the offsets are always in sync with the broker. Since > the > > broker only serves unique messages (based on the proposed solution on the > > producer/broker end) all we need to ensure is that a client does not > > consume messages with the same offset twice. > > > > > > Please suggest improvements if this does not look like the correct > > approach. Also would be great if someone can come up with a completely > > different approach so that we can weigh up pros and cons. > > > > > > Thanks for reading this through and looking forward to your opinions. > > > > > > Regards, > > > Sohaib > > > > > > Regards, > > > Sohaib Iftikhar > > > > > > -- Man is still the most extraordinary computer of all.-- > > > > > > > > > On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <lizhan...@gmail.com > > <mailto:lizhan...@gmail.com>> wrote: > > > Hi Sohaib, > > > > > > About multiple key support, the following code snippet should clarify > > your doubt: > > > org.apache.rocketmq.common.message.Message class has overloaded > setKeys > > methods, allowing your to set multiple keys via string(separated by > > space…sorry, we have not yet unified all separators, hoping this does not > > confuse you) or collection. > > > > > > > > > When broker tries to build index for the message with multiple keys, > > multiple index entries are inserted into the indexing file. > > > See org.apache.rocketmq.store.index.IndexService#buildIndex > > > > > > > > > In terms of eliminating message duplication, personally, I wish we have > > a feature of exactly-once semantic covering the whole cluster and the > > complete send-store-consume processes. A rough idea is route the message > > according to its unique key to a broker according to a rule; The serving > > broker ensures uniqueness of the message according to the key( as you > said, > > bloom-filter/cuckoo-filter, etc); Things might looks simple, but issues > > resides in scenarios where cluster is experiencing membership changes: > for > > example, what if a broker crashed down? We might need propagate > > bloom-filter bitset synchronously to other brokers having the same > topics; > > What if a new broker joins in the cluster and starts to serve? I do not > > mean this is too complex to implement. Instead, this is a pretty > > interesting topic and fancy feature to have. Alternatively, we might > defer > > eliminating duplicates to the consumption phase using kind of distributed > > set. For sure, my proposing idea suffers the same challenges including > > membership changes. > > > > > > Guys of dev board, any insights on this issue? > > > > > > Zhanhui Li > > > > > > > > >> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1...@gmail.com <mailto: > > sohaib1...@gmail.com>> 写道: > > >> > > >> Hi Zhanhui, > > >> > > >> I have a doubt about these multiple keys. If I am wrong in any of the > > >> assumptions I make please point it out. > > >> > > >> If there is support for multiple keys I cannot see this in the code. > The > > >> class Message only stores a single key in the property map against the > > >> property name "KEYS". Is this also done in the same ways as tags? That > > is > > >> different keys are separated with ' || '? So basically as a user of > the > > >> producer API it is the user's responsibility to ensure that he > separates > > >> the different keys with the correct separator. I can see an obvious > > problem > > >> here. What if the key contains this special character ' || '? But > maybe > > >> this event is rare and hence this is not important. Could you point me > > to > > >> some source/doc that explains this part? I was looking at the index > > section > > >> rocketmq-store but I have not been able to understand the indexing > > process > > >> completely for now. I will keep reading the source to get a better > idea. > > >> > > >> Moving on to the implementational details. Here is a broad idea of one > > >> possible way to approach it. > > >> > > >> The attempt is to remove duplicate messages. In this issue, I would > > like to > > >> aim at eliminating duplicate messages at the producer/broker end. For > > now, > > >> we do not concern ourselves with the duplicate messages happening due > to > > >> unwritten consumer offsets as these two issues have different > solutions. > > >> One way to solve this problem at the producer/broker end could be to > > have a > > >> distributed key store that stores the messages. We can make it > > configurable > > >> such that this distributed store stores all messages or works as a > > sliding > > >> window keeping only the messages from the last X seconds specified by > > the > > >> user. We can have a layer on top to check set membership such as a > bloom > > >> filter or a cuckoo filter ( > > >> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf < > > https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>) to help > > >> performance. Every message being pushed in by a producer are checked > in > > >> first with the filter and in case of a positive result with this key > > store. > > >> If the message is found then it is discarded. This helps remove > > duplicates > > >> completely from a producer perspective. The core of this idea is the > > >> distributed key store which would be completely separate from the > > current > > >> message storage. Since the concept of a distributed key store or a > > >> key/value store is not novel there are two ways to this. > > >> 1. Implement it ourselves. This would be high effort but no external > > >> dependencies. > > >> 2. Use a key-value store such as Redis (which already has timeouts and > > >> persistence but a large memory footprint) or some other disk-based > > storage > > >> for set membership. This would include an external dependency but > > >> development time will reduce significantly for such a solution. > > >> I am inclined towards implementing it by myself as this would avoid > > >> dependencies on other products especially since RocketMQ is currently > a > > >> self-reliant system. In addition, my past experience with building > such > > a > > >> store should also come in handy. > > >> > > >> I would like to know the opinions of the development community on this > > >> approach and to suggest improvements on it. Looking forward to your > > >> responses to this. > > >> > > >> ====<question unrelated to issue>===== > > >> To increase my familiarity with the code base and to help prove that I > > am > > >> familiar with the tools and technologies in place it would be great > if I > > >> could be pointed to some low effort issues that I could help out with. > > In > > >> case there are no 'newbie' issues available I could help improve the > > >> comments inside the codebase. I noticed some source files with no > > >> explanations which can be documented via comments to help onboard a > new > > >> contributor faster. > > >> ====</question unrelated to issue>===== > > >> > > >> Thanks a lot for reading this through and looking forward to your > > opinions. > > >> > > >> Regards, > > >> Sohaib > > >> > > >> > > >> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <lizhan...@gmail.com > > <mailto:lizhan...@gmail.com>> wrote: > > >> > > >>> Hi Sohaib, > > >>> > > >>> Happy to know you are interested in RocketMQ. > > >>> > > >>> First, let me answer questions you raised. > > >>> > > >>> — can there be multiple tags? > > >>> No. At present, the storage engine allows single tag only. > > Subscriptions > > >>> are allowed to use combination of tags. The current model should meet > > your > > >>> business development. If not, please let us know. > > >>> > > >>> > > >>> — key (Similar question to above.) > > >>> RocketMQ builds index using message keys. A single message may have > > >>> multiple keys. > > >>> > > >>> — About redundant message > > >>> From my understanding, you are trying to eliminate duplicate > messages. > > >>> True there are various reasons which may cause message duplication, > > ranging > > >>> from message delivery and consumption. Discussion on this topic is > > warmly > > >>> welcome. Had you had any idea to contribute on this issue, the > > developer > > >>> board is happy to discuss. > > >>> > > >>> Zhanhui Li > > >>> > > >>> > > >>> > > >>> > > >>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <sohaib1...@gmail.com <mailto: > > sohaib1...@gmail.com>> 写道: > > >>>> > > >>>> My earlier email message seems to have gotten lost. So I will try > > again. > > >>>> Please see the original message for the discussion. > > >>>> > > >>>> Regards, > > >>>> Sohaib Iftikhar > > >>>> > > >>>> -- Man is still the most extraordinary computer of all.-- > > >>>> > > >>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar < > > sohaib1...@gmail.com <mailto:sohaib1...@gmail.com>> > > >>>> wrote: > > >>>> > > >>>>> Hi, > > >>>>> > > >>>>> I am interested in working on this issue ( > https://issues.apache.org/ > > <https://issues.apache.org/> > > >>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a few > questions > > for > > >>>>> the same. I am not sure if this discussion needs to be on the JIRA > > >>> issue or > > >>>>> here. Feel free to correct me if this is the wrong platform. Also > > while > > >>> I > > >>>>> have worked with distributed pub-sub systems I am still fairly new > to > > >>>>> Rocket-MQ so maybe my understanding of it is incorrect. I apologise > > if > > >>> that > > >>>>> is the case and would be happy to stand corrected. > > >>>>> > > >>>>> Following are my questions: > > >>>>> 1. What defines a redundant message? > > >>>>> The constructor that I see for a message is as follows: > > >>>>> Message(String topic, String tags, String keys, int flag, byte[] > > >>> body, > > >>>>> boolean waitStoreMsgOK) > > >>>>> Possible candidates to me are topic, tags (can there be multiple > > >>> tags? > > >>>>> I could not find an example for this. If yes how are they > > separated?), > > >>> keys > > >>>>> (Similar question to above.) and of course the body. Is there > > something > > >>>>> that I have missed in this? Is there something that we do not need > to > > >>>>> consider? > > >>>>> 2. Is their a timeline on the redundant messages? What I mean by > > this is > > >>>>> that is there a time limit after which a message with similar > > content is > > >>>>> allowed. From what I gather there was no such thing mentioned. This > > >>> would > > >>>>> mean storing all the messages. Depending on the requirements this > > may or > > >>>>> may not be the best solution. It might be desirable that no > > duplicates > > >>> are > > >>>>> needed within a certain time window (sliding). This allows ignoring > > of > > >>>>> duplicate messages that were generated very close to each other (or > > in > > >>> the > > >>>>> window indicated). Depending on this requirement implementation may > > >>> become > > >>>>> a little bit more involved. > > >>>>> > > >>>>> For now, these are the only questions. I have ideas that need > review > > >>> about > > >>>>> possible implementations but I will mention them once the > > specifications > > >>>>> are clear to me. As an end question, I would at some point like to > > post > > >>>>> design ideas to this problem privately to get it reviewed by the > > >>>>> development community but not make it publicly available so that it > > >>> cannot > > >>>>> be plagiarised. What platform/method can I use to do that? Or is > > >>> submitting > > >>>>> a draft to the Google platform the only possible way to accomplish > > this? > > >>>>> > > >>>>> Thanks a lot for reading this through and looking forward to your > > >>> inputs. > > >>>>> > > >>>>> Regards, > > >>>>> Sohaib Iftikhar > > >>>>> > > >>> > > >>> > > > > > > > > > > > > > >