@Zhanhui Thanks for the response. This is not a campaign its just part of
GSoC (https://summerofcode.withgoogle.com/). And community help is gladly
welcomed. In fact, it is recommended :)

@KaiYuan Thanks for your suggestions. I will come up with a flow chart for
the proposed solution this weekend.

Thanks,
Sohaib


On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <lizhan...@gmail.com> wrote:

> Hi Sohaib,
>
> I have been sort of busy this these days. Sorry to reply you so late!
>
> So sure what “deadline” you are referring to. If this is part of a
> campaign, I have to admit I am not aware of the regulations and what kind
> of help I should offer to maintain fairness considering other arising
> similar issues.
>
> Regards!
>
> Zhanhui Li
>
>
> > 在 2018年3月1日,上午3:43,Sohaib Iftikhar <sohaib1...@gmail.com> 写道:
> >
> > Hi guys,
> >
> > Would be nice to have some feedback on this as the deadline is not too
> far :)
> >
> > Thanks,
> > Sohaib
> >
> > Regards,
> > Sohaib Iftikhar
> >
> > -- Man is still the most extraordinary computer of all.--
> >
> >
> > On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <sohaib1...@gmail.com
> <mailto:sohaib1...@gmail.com>> wrote:
> > Thank you for the pointers to the code. This was super helpful. The
> multiple keys can probably be serialized better than separating them with a
> space but that is already legacy I suppose.
> >
> > Firstly filters like bloom or cuckoo are heuristic. They can help make
> things faster but definitely cannot be used as the only solution. Hence, in
> the end, we will still need a persistent keystore/distributed set. My plan
> was to have this keystore as distributed (raft guarantee etc.). The
> keystore can also hold a persistent filter on its end. If a broker
> collapses it can renew/refresh its filter from the keystore. Hence
> eliminating the problems about crashes that you mention. The problem here
> could be in maintaining performance for filters in case of removals from
> the keystore (for eg: sliding windows as mentioned in my previous mail).
> Periodic refreshal of filters can help solve this but I am open to
> suggestions on how to make this better.
> >
> > I think implementing a distributed set on the client cluster has its
> caveats. The way I understand RocketMQ is that we do not have control over
> the diskspace/memory on the client end. So we probably only have a constant
> amount. A distributed set on the client would also need to be persistent.
> For eg: if a client restarts/recovers etc. This basically means we need a
> keystore on the client instead of the broker cluster. This probably puts
> too much responsibility on the client cluster. A different approach would
> be to ensure that the offsets are always in sync with the broker. Since the
> broker only serves unique messages (based on the proposed solution on the
> producer/broker end) all we need to ensure is that a client does not
> consume messages with the same offset twice.
> >
> > Please suggest improvements if this does not look like the correct
> approach. Also would be great if someone can come up with a completely
> different approach so that we can weigh up pros and cons.
> >
> > Thanks for reading this through and looking forward to your opinions.
> >
> > Regards,
> > Sohaib
> >
> > Regards,
> > Sohaib Iftikhar
> >
> > -- Man is still the most extraordinary computer of all.--
> >
> >
> > On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <lizhan...@gmail.com
> <mailto:lizhan...@gmail.com>> wrote:
> > Hi Sohaib,
> >
> > About multiple key support, the following code snippet should clarify
> your doubt:
> > org.apache.rocketmq.common.message.Message class has overloaded setKeys
> methods, allowing your to set multiple keys via string(separated by
> space…sorry, we have not yet unified all separators, hoping this does not
> confuse you) or collection.
> >
> >
> > When broker tries to build index for the message with multiple keys,
> multiple index entries are inserted into the indexing file.
> > See org.apache.rocketmq.store.index.IndexService#buildIndex
> >
> >
> > In terms of eliminating message duplication, personally, I wish we have
> a feature of exactly-once semantic covering the whole cluster and the
> complete send-store-consume processes. A rough idea is route the message
> according to its unique key to a broker according to a rule; The serving
> broker ensures uniqueness of the message according to the key( as you said,
> bloom-filter/cuckoo-filter, etc);  Things might looks simple, but issues
> resides in scenarios where cluster is experiencing membership changes: for
> example, what if a broker crashed down? We might need propagate
> bloom-filter bitset synchronously to other brokers having the same topics;
> What if a new broker joins in the cluster and starts to serve? I do not
> mean this is too complex to implement. Instead, this is a pretty
> interesting topic and fancy feature to have. Alternatively, we might defer
> eliminating duplicates to the consumption phase using kind of distributed
> set. For sure, my proposing idea suffers the same challenges including
> membership changes.
> >
> > Guys of dev board, any insights on this issue?
> >
> > Zhanhui Li
> >
> >
> >> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1...@gmail.com <mailto:
> sohaib1...@gmail.com>> 写道:
> >>
> >> Hi Zhanhui,
> >>
> >> I have a doubt about these multiple keys. If I am wrong in any of the
> >> assumptions I make please point it out.
> >>
> >> If there is support for multiple keys I cannot see this in the code. The
> >> class Message only stores a single key in the property map against the
> >> property name "KEYS". Is this also done in the same ways as tags? That
> is
> >> different keys are separated with ' || '? So basically as a user of the
> >> producer API it is the user's responsibility to ensure that he separates
> >> the different keys with the correct separator. I can see an obvious
> problem
> >> here. What if the key contains this special character ' || '? But maybe
> >> this event is rare and hence this is not important. Could you point me
> to
> >> some source/doc that explains this part? I was looking at the index
> section
> >> rocketmq-store but I have not been able to understand the indexing
> process
> >> completely for now. I will keep reading the source to get a better idea.
> >>
> >> Moving on to the implementational details. Here is a broad idea of one
> >> possible way to approach it.
> >>
> >> The attempt is to remove duplicate messages. In this issue, I would
> like to
> >> aim at eliminating duplicate messages at the producer/broker end. For
> now,
> >> we do not concern ourselves with the duplicate messages happening due to
> >> unwritten consumer offsets as these two issues have different solutions.
> >> One way to solve this problem at the producer/broker end could be to
> have a
> >> distributed key store that stores the messages. We can make it
> configurable
> >> such that this distributed store stores all messages or works as a
> sliding
> >> window keeping only the messages from the last X seconds specified by
> the
> >> user. We can have a layer on top to check set membership such as a bloom
> >> filter or a cuckoo filter (
> >> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf <
> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>) to help
> >> performance. Every message being pushed in by a producer are checked in
> >> first with the filter and in case of a positive result with this key
> store.
> >> If the message is found then it is discarded. This helps remove
> duplicates
> >> completely from a producer perspective. The core of this idea is the
> >> distributed key store which would be completely separate from the
> current
> >> message storage. Since the concept of a distributed key store or a
> >> key/value store is not novel there are two ways to this.
> >> 1. Implement it ourselves. This would be high effort but no external
> >> dependencies.
> >> 2. Use a key-value store such as Redis (which already has timeouts and
> >> persistence but a large memory footprint) or some other disk-based
> storage
> >> for set membership. This would include an external dependency but
> >> development time will reduce significantly for such a solution.
> >> I am inclined towards implementing it by myself as this would avoid
> >> dependencies on other products especially since RocketMQ is currently a
> >> self-reliant system. In addition, my past experience with building such
> a
> >> store should also come in handy.
> >>
> >> I would like to know the opinions of the development community on this
> >> approach and to suggest improvements on it. Looking forward to your
> >> responses to this.
> >>
> >> ====<question unrelated to issue>=====
> >> To increase my familiarity with the code base and to help prove that I
> am
> >> familiar with the tools and technologies in place it would be great if I
> >> could be pointed to some low effort issues that I could help out with.
> In
> >> case there are no 'newbie' issues available I could help improve the
> >> comments inside the codebase. I noticed some source files with no
> >> explanations which can be documented via comments to help onboard a new
> >> contributor faster.
> >> ====</question unrelated to issue>=====
> >>
> >> Thanks a lot for reading this through and looking forward to your
> opinions.
> >>
> >> Regards,
> >> Sohaib
> >>
> >>
> >> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <lizhan...@gmail.com
> <mailto:lizhan...@gmail.com>> wrote:
> >>
> >>> Hi Sohaib,
> >>>
> >>> Happy to know you are interested in RocketMQ.
> >>>
> >>> First, let me answer questions you raised.
> >>>
> >>> — can there be multiple tags?
> >>> No. At present, the storage engine allows single tag only.
> Subscriptions
> >>> are allowed to use combination of tags. The current model should meet
> your
> >>> business development. If not, please let us know.
> >>>
> >>>
> >>> — key (Similar question to above.)
> >>> RocketMQ builds index using message keys. A single message may have
> >>> multiple keys.
> >>>
> >>> — About redundant message
> >>> From my understanding, you are trying to eliminate duplicate messages.
> >>> True there are various reasons which may cause message duplication,
> ranging
> >>> from message delivery and consumption. Discussion on this topic is
> warmly
> >>> welcome.  Had you had any idea to contribute on this issue, the
> developer
> >>> board is happy to discuss.
> >>>
> >>> Zhanhui Li
> >>>
> >>>
> >>>
> >>>
> >>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <sohaib1...@gmail.com <mailto:
> sohaib1...@gmail.com>> 写道:
> >>>>
> >>>> My earlier email message seems to have gotten lost. So I will try
> again.
> >>>> Please see the original message for the discussion.
> >>>>
> >>>> Regards,
> >>>> Sohaib Iftikhar
> >>>>
> >>>> -- Man is still the most extraordinary computer of all.--
> >>>>
> >>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <
> sohaib1...@gmail.com <mailto:sohaib1...@gmail.com>>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am interested in working on this issue (https://issues.apache.org/
> <https://issues.apache.org/>
> >>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a few questions
> for
> >>>>> the same. I am not sure if this discussion needs to be on the JIRA
> >>> issue or
> >>>>> here. Feel free to correct me if this is the wrong platform. Also
> while
> >>> I
> >>>>> have worked with distributed pub-sub systems I am still fairly new to
> >>>>> Rocket-MQ so maybe my understanding of it is incorrect. I apologise
> if
> >>> that
> >>>>> is the case and would be happy to stand corrected.
> >>>>>
> >>>>> Following are my questions:
> >>>>> 1. What defines a redundant message?
> >>>>>   The constructor that I see for a message is as follows:
> >>>>>   Message(String topic, String tags, String keys, int flag, byte[]
> >>> body,
> >>>>> boolean waitStoreMsgOK)
> >>>>>   Possible candidates to me are topic, tags (can there be multiple
> >>> tags?
> >>>>> I could not find an example for this. If yes how are they
> separated?),
> >>> keys
> >>>>> (Similar question to above.) and of course the body. Is there
> something
> >>>>> that I have missed in this? Is there something that we do not need to
> >>>>> consider?
> >>>>> 2. Is their a timeline on the redundant messages? What I mean by
> this is
> >>>>> that is there a time limit after which a message with similar
> content is
> >>>>> allowed. From what I gather there was no such thing mentioned. This
> >>> would
> >>>>> mean storing all the messages. Depending on the requirements this
> may or
> >>>>> may not be the best solution. It might be desirable that no
> duplicates
> >>> are
> >>>>> needed within a certain time window (sliding). This allows ignoring
> of
> >>>>> duplicate messages that were generated very close to each other (or
> in
> >>> the
> >>>>> window indicated). Depending on this requirement implementation may
> >>> become
> >>>>> a little bit more involved.
> >>>>>
> >>>>> For now, these are the only questions. I have ideas that need review
> >>> about
> >>>>> possible implementations but I will mention them once the
> specifications
> >>>>> are clear to me. As an end question, I would at some point like to
> post
> >>>>> design ideas to this problem privately to get it reviewed by the
> >>>>> development community but not make it publicly available so that it
> >>> cannot
> >>>>> be plagiarised. What platform/method can I use to do that? Or is
> >>> submitting
> >>>>> a draft to the Google platform the only possible way to accomplish
> this?
> >>>>>
> >>>>> Thanks a lot for reading this through and looking forward to your
> >>> inputs.
> >>>>>
> >>>>> Regards,
> >>>>> Sohaib Iftikhar
> >>>>>
> >>>
> >>>
> >
> >
> >
>
>

Reply via email to