Re: [infinispan-dev] [infinispan-internal] Continuous Queries

Mircea Markus Fri, 18 Oct 2013 05:37:00 -0700

On Oct 18, 2013, at 1:06 PM, Sanne Grinovero <[email protected]> wrote:


> On 18 October 2013 12:12, Mircea Markus <[email protected]> wrote:
>> 
>> On Oct 17, 2013, at 11:29 PM, Sanne Grinovero <[email protected]> wrote:
>> 
>>> On 17 October 2013 20:19, Mircea Markus <[email protected]> wrote:
>>>> let's keep this on -dev.
>>> 
>>> +1
>>> 
>>>> On Oct 17, 2013, at 6:24 PM, Sanne Grinovero <[email protected]> wrote:
>>>>> ----- Original Message -----
>>>>>> 
>>>>>> On Oct 17, 2013, at 2:28 PM, Sanne Grinovero <[email protected]> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Original Message -----
>>>>>>>> On Oct 17, 2013, at 1:31 PM, Sanne Grinovero <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> With some custom coding it's certainly possible to define an event
>>>>>>>>> listener
>>>>>>>>> which triggers when an entry is inserted/removed which matches a 
>>>>>>>>> certain
>>>>>>>>> Query.
>>>>>>>> 
>>>>>>>> where would hold the  the query result? a cache perhaps?
>>>>>>> 
>>>>>>> Why do you need to hold on to the query result?
>>>>>>> I was thinking to just send an event "newly stored X matches query Q1".
>>>>>> 
>>>>>> You don't have a single process receive all the notifications then, but
>>>>>> multiple processes in the cluster. It's up to the user to aggregate these
>>>>>> results (that's why I mentioned a cache) but without aggregation this
>>>>>> feature is pretty limiting.
>>>>> 
>>>>> I have no idea if it's limiting. For the use case I understood, that's 
>>>>> pretty decent.
>>>> 
>>>> Here's my understanding of CQ[1]: a user queries a cache 10000000( you add 
>>>> the rest of 0) per second.
>>>> Instead of executing the query every time (very resource consuming) the 
>>>> system caches the query result, update it when underlying data gets 
>>>> modified, and return to the user on every invocation. Optionally you can 
>>>> register a listener on the query result, but that's just API sugar.
>>> 
>>> That's an implementation detail, I need a use case.
>>> 
>>> Assuming you store a good amount of entries, you know, maybe so many
>>> that I actually need a data grid instead of a simple HashMap or a USB
>>> stick, as a Query user I don't think I would always want to actually
>>> fetch locally all data, when all I need is maybe sound an alarm bell.
>>> 
>>> A use case could be that I'm interested in some stock, specifically I
>>> want to be notified ASAP for course changes for the stock traded on
>>> market "Neverland", so I register a continuous query "from stock where
>>> stock.market = 'Neverland' ".
>>> Let's also assume that Neverland trades approximately 5,000 titles.
>>> 
>>> My application starts and fetches all current values with a one-off
>>> full query (using that same query), so I fetch all 5,000 locally. Next
>>> step, I want to be notified ASAP when one of these change value, so
>>> that I can react on it.
>>> Then I get my first notification! cool, my nice List API provides me
>>> with the new value for 5,000 titles.. which one changed? let me find
>>> out, I can scan on my previous results and find out..
>>> (Note that I'm not even getting into the detail of how we got all
>>> those titles locally: using deltas or not is irrelevant).
>>> 
>>> That's certainly doable, but what if you have more than 5,000 titles..
>>> it's degenerating. Of course you could wrap this "resultset" in some
>>> more syntactic sugar, but essentially what you need to implement the
>>> client side API is to receive the single events.
>>> 
>>> I'm not focusing on the client side sugar because of Divya's original 
>>> question:
>>> "a feasible path to achieve this functionality via some custom
>>> coding, even though it is not the most efficient path (because
>>> Continuous Queries are not available out of the box)."
>>> 
>>>> From a very different perspective, look at it in terms of a scalable
>>> architecture: when dealing with large amounts of data, the List
>>> interface is conceptually not cutting it; I would expect you to ban
>>> it, not to encourage it.
>>> Assuming the client is also designed as a a properly scalable system,
>>> if you were to provide it with a List this would likely need to
>>> iterate on it to forward each single element as a task to some
>>> parallel executor. It's much simpler if you push them one by one: it
>>> could still wrap each in a task, but you cut on the latency which you
>>> would otherwise introduce to collect all single items and you can
>>> allow users to insert a load balancer between your crazy scalable
>>> event generator and the target of these notifications.
>>> 
>>> (Because really if you setup such a feature on a large grid, it will
>>> be come a crazy scalable event generator)
>>> 
>>>>>>> You could register multiple such listeners, getting the effect of "newly
>>>>>>> stored entry X matches Query set {Q1, Q3, Q7}"
>>>>>> 
>>>>>> The listeners would not be collocated.
>>>>> 
>>>>> I'm not going to implement distributed listeners, I indeed expect you to 
>>>>> register such a listener on each node.
>>>> 
>>>> If I run a query, continuous or not, I'd expect to be able to get all the 
>>>> result set of that query on the process on which I invoke it. Call me old 
>>>> fashion :-)
>>>> 
>>>>> 
>>>>> I can show how to make Continous Queries on the Query API to accomplish 
>>>>> this.
>>>> 
>>>> I wouldn't name the problem your solution solve Continuous Query :-)
>>>> 
>>>>> Anything else is out of scope for me :-) Technically I think it's out of 
>>>>> scope for Infinispan too, it should delegate to a message bus.
>>>> 
>>>> -1, for the reasons mentioned above.
>>>> 
>>>> [1] http://coherence.oracle.com/display/COH31UG/Continuous+Query
>>> 
>>> Do you realize this page is confirming a List is fundamentally wrong :-)
>>> it's listing a bunch of fallacies to explain common errors, which all
>>> boil down to an attempt of iterating on the entries, and then states:
>>> 
>>> "The solution is to provide the listener during construction, and it
>>> will receive one event for each item that is in the Continuous Query
>>> Cache, whether it was there to begin with (because it was in the
>>> query) or if it got added during or after the construction of the
>>> cache"
>>> 
>>> Finally, a consistency consideration on how to create such a list: if
>>> you get multiple events in short time, you'll never know which one is
>>> correct because of interleaving of the notifications.There is no way
>>> to iterate (search) a list of results in Infinispan in a consistent
>>> transactional view, unless you want me to lock all entries and repeat
>>> the query to confirm.
>> 
>> For many many users this getting a snapshot-result is good enough. After all 
>> this is how relational databases are queried.
>> 
>>> By NOT providing a List access, you avoid the
>>> problem of consistency and don't introduce contentions points like
>>> "aggregating it all in one placeholder".
>> 
>> Well Coherence supports both List(the CQ Cache itself) and event based, 
>> events being the preferred way when you don't want to miss any updated to 
>> the result set.
>> Also very important, the mechanism you described does't offer this 
>> consistency guarantee (e.g. between the time the user runs the query and he 
>> registers the listeners things might change).
> 
> That's what I said: you can't make a List in that time, but the event
> happened so it's fair to notify about it.
> 
>> Another (fundamental IMO) limitation that the approach we can offer has is 
>> the locality of the notifications:  the initial query executes the on node A 
>> and receives future notifications of other elements matching the query 
>> criteria on node B, C etc.
>> 
>>> Also interesting from Coherence's wiki: they have their results
>>> implement InvocableMap, essentially a representation of a conceptual
>>> data partition on which you can the invoke operations, by moving
>>> execution to the data. I think that's brilliant, and makes it quite
>>> clear that no such list is sent to the client.
>> 
>> Not really, the cache itself is the list :-)
> 
> That sounds very confusing to me, the cache is definitely not a list.
> If you mean to point out that it "represents" a local view of all
> data,

yes :-)

> that's fishy as it either contains a copy of all data (not nice
> when it's large)

Not if you only keep the set of keys locally and fetch the values (you might 
not even need them) on demand.

> or it's a proxy which will be extremely slow by
> "lazy-loading" each entry.

Indeed you might need to get the value based on the key with an RPC. I wouldn't 
call that as extremely slow, after all it's just a cache lookup.

> The InvocableMap approach sounds far more
> interesting in terms of locality.

It's still something that will go remotely on every invocation. If you need to 
do that very often(few 1000 times a sec), better to cache results locally.

> 
>> 
>> I don't think that with what we currently have we're that close to the CQ 
>> caches as the industry "defines" them. If this listener followed distributed 
>> notifications can be useful, then very good. I would refrain from marketing 
>> this as CQ support as would create false expectations.
> 
> Happy to not do it!

I don't think the query API extension you mention is critical here, as the 
filtering logic can be expressed directly in java (which might me actually more 
convenient/flexible).
Looking around, the CQ functionality that's missing in ISPN is:
- offer a way to receive all the notifications in the same VM
- offer a way to cache the result(might be keys only) in order to avoid 
executing the same query very often

Let's continue our chat on this next week ;)
 
Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)





_______________________________________________
infinispan-dev mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Re: [infinispan-dev] [infinispan-internal] Continuous Queries

Reply via email to