Re: [MarkLogic Dev General] Registered Queries

David Ennis Sun, 30 Mar 2014 14:10:22 -0700

HI Mike.

Thanks for the reply.

A few items:

*Testing that registered queries cause affect ingestion:*
We restarted each forest and re-did ingestion free of any registered
queries.  We then primed the system (as the client has already been doing)
by registering the queries and then doing ingestion again with no-one
hitting the queries.

*re-registering to protect against XDMP-UNREGISTERED:*
The code in place already takes care of this per user per query type.  I
believe they actually doa  try/catch and re-register on failure, but will
dive into that tomorrow.

*lazy registering:*
Well, registering the queries for just to users takes about 10 minutes. So,
we definately need to look at diving into what is under the hood and
optimizing the queries.  Until then, lazy registering would still take too
long. Plus, once ingestion starts, the queries are again not performing as
needed.

*smart registering:*
Yes, it is interesting to think of registering the queries when a user logs
in, etc.  However, the client would likely (rightly) state that the cleints
that take the longest to register should still need to be faster

*Query optimizing:*
That's the next step. Initial analysis already shows that the queries can
be re-organized/reduced. The original challenge seems to be that the
original entitlements file(s) causes quite a bit of extra queries because
they are parsed sequentially when in fact, there are many rules that can be
combined.  We are already looking at shotgun OR statements. In one case,
this will change 1500 individual queries cts:and (this=that, cts:and(this
between a and b)) to about 200 by using a sequence for 'that'.. In
addition, it is likely that the 200 will be further reduces by merging
overlapping date ranges.  The next level would then be taking this and
applying a similar approach across all users if possible.  It is also
possible to negate the query for those users whod have access to almost
everytihng (list what to exclude rather than what to include)

The end result would likely be that the majority of all users would run
just fine even without any registered queries and then perhaps register the
optimized version only for those that need it.  Even this could be
automated by every 'x' query calls monitor either the number of terms
and/or the query response time and add/remove that user from a list of
users that would benefit from registered queries.

*Make our own 'list caches':*
This is the last option which we know should work.. pre-process each user
and generate lists of URIs that match the search terms (replacing
registered queries). We would then update these lists in a smart way
depending on what changes when.. The benefit is that we should be able to
get results directly form the intersection of 2-3 indexes. But it would be
large to the point of silliness (over 100 million entries)

In the end, I'm sure the query optimizing (the obvious next step) combined
with only registering the users with the most complex result-sets will be
sufficient.

Thanks for the tips,
David

On 30 March 2014 20:15, Michael Blakeley <[email protected]> wrote:

> That sounds like "entitlements". A registered query is a pretty good way
> to represent entitlements. But I wouldn't register all the queries up
> front. That seems like wasted effort if some of the users never log in. If
> it impacts update performance then it's wasted effort for every update
> between startup and the user's first query. Finally the query code still
> has to be prepared to re-register the query if a search throws
> XDMP-UNREGISTERED, and that can happen any time. So I think it's better to
> be lazy about registered queries.
>
> How have you tested the idea that registered queries are causing the
> update problems? It would be annoying to put a lot of work into
> entitlements code without fixing that problem. For example you could
> unregister all those queries. Or restarting all the database's forests
> would have the same effect. Then you could test ingestion again, without
> any registered queries in play. Given the parallel thread about directory
> fragments I'd also set file-log-level=Debug and look out for debug-level
> XDMP-DEADLOCK messages.
>
> Getting back to use of registered queries, my usual approach is to
> register each query as it's used, something like:
>
>     cts:registered-query(
>       cts:register(query-entitlements($userid)),
>         'unfiltered')
>
> This re-registers the query every time, which guards against
> XDMP-UNREGISTERED. It is cheap for queries that are already registered.
>
> By registering queries lazily, queries belonging to idle users will tend
> to fall out of the system. If there's a front-end server that can tell your
> MarkLogic app-server when users log in or out, you could even preregister
> queries and deregister queries. However you'd still have to guard against
> XDMP-UNREGISTERED, as with the above code.
>
> And as you outlined I'd try to simplify the queries if possible. The
> complexity may be more of a problem than the quantity. One important tool
> for this is the "shotgun OR": most cts:query constructors support a
> sequence of values. It might also be worth looking into composable groups
> of registered queries, so that N users can share a smaller number of
> registered queries.
>
> -- Mike
>
> On 30 Mar 2014, at 02:13 , David Ennis <[email protected]> wrote:
>
> > Hi Mike,
> >
> > Thanks for the reply.  We've just started a consulting job at this
> client and are unravelling the various levels of the programming.  We
> suspected that registering all of those queries and then hitting the system
> with 100,000 inserts was bound to bog down the system.
> >
> > Their system had ~50 million items with ~2,000 users having access to
> certain groups of those 50 million based on a subscription file per user.
>  These subscription files have sometimes 20,000 entries.  It appears that
> early on, they got stuck on how to approach this (we know that they are
> generating some 'subscription queries' that have thousands of nested
> cts:and queries, for instance). The solution at the time was to simply
> register the monsterous queries.  This just appears to have compounded the
> issue by introducing another items causing a bottlneck (the internal
> maintenance of the queries).  So when tuning the original queries, any gain
> in performance was likely masked by the newer delay (registered queries).
> >
> > Our approach now is likely to abandon their registered queries and a
> combination of (1) optimize the original queries (looks like terms can be
> boiled down to hundreds instead of thousands) and possibly also generate
> our own 'smart caches' per user that could be updated in various
> less-intensive manner.
> >
> > Regards,
> > David Ennis
> >
> >
> > On 29 March 2014 14:58, Michael Blakeley <[email protected]> wrote:
> > Registered queries are smart list-cache entries.
> >
> > You've already deduced that that implies extra work when updates happen,
> either immediately or when each registered query is next used. With a lot
> of registered queries it's probably more efficient to do that work with
> each update, but I haven't noticed that behavior myself.
> >
> > Why pre-register so many queries? As a rule of thumb it isn't worth
> registering a query unless it will be used it 2-3 times. Maybe that should
> be 2-3 times before the next update, too.
> >
> > -- Mike
> >
> > On 28 Mar 2014, at 22:48 , David Ennis <[email protected]> wrote:
> >
> > > HI.
> > >
> > > We have a client that has about 4,000 registered queries.  These are
> rather 'large' (taking about 30 minutes to register all of them.
> > >
> > > One of the tests yesterday seems to confirm that ingestion of new
> content is 1/2 as slow when the queries are registered. Unregistering the
> queries again increases throughput of the ingestion.
> > >
> > > It should be noted that no queries are being run - they are just
> sitting registered.
> > >
> > > Can someone explain the inner workings of registered queries?  It
> seems to me that there is some level of maintenance of caches related to
> these registered queries as new documents are ingested - regardless of the
> query being used.
> > >
> > > Intuition says that this is likely the case, but I would like to be
> sure and cannot find enough information to truly support this theory.
> > >
> > > So, does registered queries do something that could be causing quite
> some overhead to internally maintain them while ingestion is happening?
> > >
> > > Kind Regards,
> > > David
> > >
> > > _______________________________________________
> > > General mailing list
> > > [email protected]
> > > http://developer.marklogic.com/mailman/listinfo/general
> >
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
> >
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Registered Queries

Reply via email to