The two main problems that I see are that you're mixing two separate things:

1. Tags which are stored in one place, and events which are stored in a
different place
2. Again tags, and document locations. They may seem similar, but they
are completely different as they are implemented. That's why saying "not
in this wiki, except for this tag" cannot ever be implemented in a sane
way since the exception is on a different plane.

The logical approach would be to also store the tags in the events, but
this is not a good thing to do. Where do we stop? For future-proofing
filters, we would have to store the entire document in the event, in
case someone wants to add a filter on "documents with a specific type of
object attached", or "documents with a specific value for a specific
property of a specific object", or "documents with attachments ending in
.extension".

On 06/28/2018 04:44 PM, Guillaume Delhumeau wrote:
> For the tags filter, I can also:
> 
> * perform a query to fetch all documents that correspond to the given tags
> (it doesn't need to be permanent, but it would be better to cache it)
> * add a HQL filter on these pages (OR document IN :list_of_pages).
> 
> It's a little variation of solution A. It's ugly but it could do the job.

It's the sanest proposal so far, but still impossible.

Since "not in LOCATION except WITH TAG" doesn't work because it mixes
two types of information in the same query fragment, how about making
this a one-dimensional query as "only WITH TAG". And since this
information is in a different place, pre-querying it is needed. However,
if there are thousands or tens/hundreds of thousands of documents with a
tag, won't the query crash anyway?

> 2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
> guillaume.delhum...@xwiki.com>:
> 
>> Hi developers.
>>
>> I am trying to add a new filter to the notifications to be able to follow
>> pages
>> that are marked with a given tag. And it leads me to some questions about
>> the
>> technical implementation of the notifications.
>>
>> To remind the context: notifications are computed on top of the events
>> recorded
>> by the event stream (a.k.a activity stream). We take events from the event
>> stream SQL table, we apply some transformations on them, and we display
>> them to
>> the user.
>>
>> Then we have implemented the ability to filter on these events: for
>> example
>> "don't show events concerning the document A nor the wiki B". Filters are
>> implemented with 2 distinct ways:
>>
>>   1/ SQL injections: each filter can add SQL elements in the query we make
>> to
>>      fetch the events from the event stream table. We made this mechanism
>> so we
>>      can let the database do a lot of the filtering process. After all,
>> it's its
>>      job and it's supposed to perform well. To be precise, Clement has
>> even
>>      created an Abstract Syntax Tree (AST) so it's easier to inject some
>> content
>>      in the query and it creates an abstraction over the SQL language so
>> we can
>>      even consider to change the storage of the event stream someday.
>>
>>      The bad thing is that some complex filtering are difficult to write
>> with
>>      the SQL language (event with the AST) or even impossible.
>>
>>   2/ Post-filtering: after the events have been fetched from the database,
>> each
>>      filter can still decide to keep or filter them. This is useful for
>>      complex filtering that cannot be expressed with the SQL language. It
>> is
>>      also needed by the real-time notification email sender, because it
>> takes
>>      the events immediately when they occurs without fetching them in the
>>      database (so SQL filters are bypassed).
>>
>>      The bad thing is that some events are loaded in the memory to finally
>> be
>>      rejected, and these filters can perform costly operations such as
>> loading
>>      documents.
>>
>> Until now, this double mechanism was working quite well, with each
>> mechanism
>> filling the lacks of the other.
>>
>> However, we still have technical limitations in our design:
>>
>>   1/ Users who have a lot of filter preferences can end up with a giant
>> SQL
>>      query that is almost impossible to perform by the database. Actually
>> we had
>>      a user complaining about an OutOfMemory problem in the HQL to SQL
>>      translator !
>>
>>   2/ I cannot implement the tag filter !
>>
>> The tag filter is supposed to show events that concern pages that hold a
>> given
>> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
>> don't
>> want to receive notifications about wiki A except for pages marked with
>> the tag
>> T".
>>
>> And it is not working. First because it is difficult to write a SQL query
>> for
>> that. It requires to make a join clause with the document and the object
>> tables,
>> which our SQL injection mechanism does not support. Even if it were
>> possible,
>> creating a SQL join with the document table will de-facto filter events
>> that do
>> not concern any pages or pages that do not have any objects: so many other
>> filters will be broken. I also don't consider creating a SQL subquery, I
>> think
>> the whole query would became too big. So I decided to just not inject any
>> SQL
>> code for this filter and only implement the post-filtering mechanism.
>>
>> But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
>> "WIKI <> 'WIKI A'" so the events concerning the wiki A are not fetched
>> from the
>> database. Consequence: the tag filter never see the events that it is
>> supposed
>> to keep. It would be actually possible to by-pass the first SQL injections
>> by
>> injecting something like "OR 1=1". But doing something like this is like
>> dropping the all SQL injections mechanism.
>>
>> I see some solutions for this problem:
>>
>>   A/ For each tag, create a permanent list of pages that hold it. So I can
>>      inject "OR document IN (that_list)". I think this is heavy.

Definitely not. You'll have to maintain this list in parallel to the
canonical storage for tags, the xwikidoc table itself, with all the
headaches that data duplication brings.

>>   B/ Drop the SQL injection mechanism and only rely on the post-filtering
>>      mechanism. It would require to load from the database A LOT of events,
>>      but maybe we could cache this.

Nope, filtering millions of events by inspecting them one by one is a
performance nightmare.

>>   C/ Don't drop the SQL injection mechanism completely but use it as
>> little as
>>      possible (for example, do not use it for LOCATION filtering). Seems
>> hard to
>>      determine when a filter should use this feature or not.

I think location filtering should still happen in the query itself,
since it is a lot faster.

>>   D/ Don't implement the "tags" filter, since it is the root of the issue.
>> But
>>      it is like sweeping dirt under the carpet!

Well, not implementing something is always the fastest, least intrusive
solution to any problem...

Humor aside, I think that in this case it is the best approach.

>> Since we have the OutOfMemory problem with the SQL injections becoming too
>> huge,
>> I am more in favor of solution B or C. But I'm not sure for now, since I
>> do not
>> know how much it would impact the performances and the scalability of the
>> whole
>> notifications feature.
>>
>> This is a complex topic, but I hope this message will inspire you some
>> suggestions or things I have not seen with my own eyes.
>>
>> Thanks for your help,
>>
>> Guillaume
>>
> 
> 
> 
Other options:

E/ Drop SQL altogether, move the event stream into a nosql database, and
do stream filtering; kind of like the B proposal

F/ Drop SQL and querying altogether, switch to a pub-sub mechanism where
a subscriber is created for every user and his filters, and matching
events are placed in a queue until they are all sent (consumed) in a
notification. Obviously, this only works for email notifications, not
for browsing past events in a webpage.

-- 
Sergiu Dumitriu
http://purl.org/net/sergiu/

Reply via email to