Re: [MarkLogic Dev General] Cts query for element attribute value matching another attribute value?

Geert Josten Mon, 13 Feb 2012 15:13:35 -0800

Danny, Mike,

Both thnx for your suggestions. Let me elaborate on my case a bit more.
Now the XMLPrague demojam is in the past, I can reveal more details.. ;)

I am processing tweets with links, plain links to internet. Most of them
are shortened. I put tags around the links, and put the shortened link in
an attribute. In my original code I directly continued to resolve the
shortened link to the full one, and put that in a second attribute. But
resolving links can be extremely slow. So I decided to postpone the
resolving, and do that in background processing. I must mention here that
I use a persisted lookup 'table' (a collection of simple key/value kind of
docs), which prevents the need to access internet to resolve the same url
twice. Links that were resolved before, don't need to be postponed. So,
some are resolved, others are not. Those not have a short and full
attribute (both range indexed for facets and such) that is equal. Both
attributes are currently always present. (Perhaps that should change, by
the way.)

Even though this was just something for the demojam, the case got me
intrigued. I initially wrote a bunch not so well thought-through code to
process incoming tweets, e.g. simply doing all in one request. But pushing
things to the background caused more trouble than I expected. Hence my
mail about the stack button etc. I guess the trouble is caused by the fact
that the urls, shortened ones just as well, can occur multiple times. For
the sake of scalability and performance, I created batches with something
like 100 to 500 tweets for background processing, but guess that same
shortened links could have occurred in different batches, causing the
dead-lock I seemed to have faced. I should mention, that I use the
shortened url as uri for the lookup key/value doc.

Those non-finishing threads were difficult to debug. I had multiple
background processes running (some injecting and enriching tweets, others
resolving urls). Shutting injection down seemed to help, but not solve the
problem entirely. It would be interesting to reproduce the error again, in
an isolated test case, but that won't be easy to find and replicate.
Instead, I decided to push the resolving of urls in a fully separate
background process. Get all available unique values first, make batches of
those, resolve these urls batch-wise (in parallel without trouble), and
update the lookup table with each batch. Subsequent background processes
will pick up tweets with unresolved urls the same way, and insert resolve
urls out of the lookup. Note, urls that have been resolved before, are
looked up entirely through value lexicons..

I could also have spawned every individual lookup update. But that sounded
really inefficient, clogging the entire task serving. Might make sense
though. Only a relatively low percentage of all tweets that are ingested
contain links that are not yet in the lookup.

Anyhow, the problem with the batch approach was finding the unique values
of unresolved shortened urls. And doing so in a scalable manner. I guess
adding a simple resolved=yes/no attribute makes most sense. Way easiest to
isolate the non-resolved ones. Almost identical to Mike's first
suggestion. Requires just one straight attr-values call with a bit of
pagination algorithm to create batches. The same might be achievable by
dropping the full attributes that are identical to the short attribute,
though the not-query is usually a bit tricky. And I would have to delete
those full attributes. Should be doable from CQ though. (Glad this is just
experimental and not production work!)

Second best might be the elem-attr-value-query with the nested
elem-attr-values of the other attrib. But that doesn't sound very
scalable. With the current database (300k tweets, 50k containing links,
14k unresolved links), that would mean passing in 14k values to the
value-query. The nr of unresolved links would depend on the backlog, but
it could grow at some point. What would happen if it would reach 100k, or
beyond?

Not sure about the performance of Danny's approach. I think it would
result in an or-query containing 14k and-queries, currently. Might work,
but quick? ;-) The co-occurrence thing got me thinking though. Could the
co-occurrence functions be of use?

A lot to think about. Thnx so far..

Kind regards,
Geert

> -----Oorspronkelijk bericht-----
> Van: [email protected] [mailto:general-
> [email protected]] Namens Michael Blakeley
> Verzonden: zondag 12 februari 2012 19:04
> Aan: MarkLogic Developer Discussion
> Onderwerp: Re: [MarkLogic Dev General] Cts query for element attribute
value
> matching another attribute value?
>
> No easy suggestions, but I have a couple.
>
> If this is a common query and needs to be very efficient, then I would
create a
> node that represents the fact that elem/@attr1 eq elem/@attr2. Then you
could
> use something like:
>
>     cts:element-attribute-value-query(xs:QName('elem'),
xs:QName('attrs-equal'),
> '1').
>
> If you prefer fewer XML changes and more database round-trips, you could
put
> a range index on elem/@attr1 (or elem/@attr2, but let's use attr1 for
now).
> Then you could write:
>
>     cts:element-attribute-value-query(
>       xs:QName('elem'), xs:QName('attr2'),
>       cts:element-attribute-values(
>         xs:QName('elem'), xs:QName('attr1')))
>
> You could speed that query up even further by configuring a range index
on both
> element-attribute pairs, and then using
cts:element-attribute-range-query with
> '=' instead of cts:element-attribute-value-query. But the 'attrs-equal'
approach
> will still outperform it most of the time, and won't use as much memory
or disk
> space.
>
> Why is any of this necessary? Because the cts:query constructors are
designed
> to be searchable - that is, to allow efficient index lookups. But there
is no index
> that would answer your question directly. There is an index entry for
every value
> of //elem/@attr1[. eq VALUE] and for //elem/@attr2[. eq VALUE] but no
index
> entry that says //elem/@attr1[. eq attr2]. So the most straightforward
way to
> resolve the XPath is to retrieve every match for //elem[@attr1][@attr2]
and
> then filter in memory to check the @attr1 eq @attr2 portion.
>
> When xdmp:plan says "Path is fully searchable", I think it's only
talking about any
> node-name and node-type step(s), not any predicates. When I call it on
your
> XPath expression, it shows only one constraint, and I get exactly the
same plan if
> I simply drop the predicate. So I think it's really only searching on
//elem, and
> ignoring both attributes until the filtering phase (that's with 5.0-2).
>
> Compare these plans:
>
> xdmp:plan(//elem[@att1 = @att2]),
> xdmp:plan(//elem),
> xdmp:plan(//elem[@attr1])
> =>
> <qry:query-plan xmlns:qry="http://marklogic.com/cts/query";>
>   <qry:info-trace>xdmp:eval("xdmp:plan(//elem[@att1 =
> @att2]),&amp;#13;&amp;#10;xdmp:plan(//elem),&amp;#1...", (), &lt;options
> xmlns="xdmp:eval"&gt;&lt;database&gt;18400529833056734238&lt;/database
> &gt;&lt;root&gt;/Users/mblakele/S...&lt;/options&gt;)</qry:info-trace>
>   <qry:info-trace>Analyzing path: fn:collection()/descendant::elem[@att1
=
> @att2]</qry:info-trace>
>   <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
>   <qry:info-trace>Step 2 is searchable: descendant::elem[@att1 =
> @att2]</qry:info-trace>
>   <qry:info-trace>Path is fully searchable.</qry:info-trace>
>   <qry:info-trace>Gathering constraints.</qry:info-trace>
>   <qry:info-trace>Executing search.</qry:info-trace>
>   <qry:final-plan>
>     <qry:and-query>
>       <qry:term-query weight="0">
>       <qry:key>7128167059298760147</qry:key>
>       </qry:term-query>
>     </qry:and-query>
>   </qry:final-plan>
>   <qry:info-trace>Selected 0 fragments</qry:info-trace>
>   <qry:result estimate="0"/>
> </qry:query-plan>
> <qry:query-plan xmlns:qry="http://marklogic.com/cts/query";>
>   <qry:info-trace>xdmp:eval("xdmp:plan(//elem[@att1 =
> @att2]),&amp;#13;&amp;#10;xdmp:plan(//elem),&amp;#1...", (), &lt;options
> xmlns="xdmp:eval"&gt;&lt;database&gt;18400529833056734238&lt;/database
> &gt;&lt;root&gt;/Users/mblakele/S...&lt;/options&gt;)</qry:info-trace>
>   <qry:info-trace>Analyzing path:
fn:collection()/descendant::elem</qry:info-
> trace>
>   <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
>   <qry:info-trace>Step 2 is searchable:
descendant::elem</qry:info-trace>
>   <qry:info-trace>Path is fully searchable.</qry:info-trace>
>   <qry:info-trace>Gathering constraints.</qry:info-trace>
>   <qry:info-trace>Executing search.</qry:info-trace>
>   <qry:final-plan>
>     <qry:and-query>
>       <qry:term-query weight="0">
>       <qry:key>7128167059298760147</qry:key>
>       </qry:term-query>
>     </qry:and-query>
>   </qry:final-plan>
>   <qry:info-trace>Selected 0 fragments</qry:info-trace>
>   <qry:result estimate="0"/>
> </qry:query-plan>
> <qry:query-plan xmlns:qry="http://marklogic.com/cts/query";>
>   <qry:info-trace>xdmp:eval("xdmp:plan(//elem[@att1 =
> @att2]),&amp;#13;&amp;#10;xdmp:plan(//elem),&amp;#1...", (), &lt;options
> xmlns="xdmp:eval"&gt;&lt;database&gt;18400529833056734238&lt;/database
> &gt;&lt;root&gt;/Users/mblakele/S...&lt;/options&gt;)</qry:info-trace>
>   <qry:info-trace>Analyzing path:
> fn:collection()/descendant::elem[@attr1]</qry:info-trace>
>   <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
>   <qry:info-trace>Step 2 is searchable:
descendant::elem[@attr1]</qry:info-
> trace>
>   <qry:info-trace>Path is fully searchable.</qry:info-trace>
>   <qry:info-trace>Gathering constraints.</qry:info-trace>
>   <qry:info-trace>Step 2 predicate 1 contributed 1 constraint:
@attr1</qry:info-
> trace>
>   <qry:partial-plan>
>     <qry:term-query weight="0">
>       <qry:key>11100480210632785569</qry:key>
>     </qry:term-query>
>   </qry:partial-plan>
>   <qry:info-trace>Step 2 predicate 1 contributed 1 constraint:
@attr1</qry:info-
> trace>
>   <qry:partial-plan>
>     <qry:term-query weight="0">
>       <qry:key>11100480210632785569</qry:key>
>     </qry:term-query>
>   </qry:partial-plan>
>   <qry:info-trace>Executing search.</qry:info-trace>
>   <qry:final-plan>
>     <qry:and-query>
>       <qry:or-query>
>       <qry:element-query>
>         <qry:key>7128167059298760147</qry:key>
>         <qry:and-query>
>           <qry:term-query weight="0">
>             <qry:key>11100480210632785569</qry:key>
>           </qry:term-query>
>         </qry:and-query>
>       </qry:element-query>
>       <qry:and-query>
>         <qry:term-query weight="0">
>           <qry:key>11397336598217694489</qry:key>
>         </qry:term-query>
>         <qry:term-query weight="0">
>           <qry:key>7128167059298760147</qry:key>
>         </qry:term-query>
>         <qry:term-query weight="0">
>           <qry:key>11100480210632785569</qry:key>
>         </qry:term-query>
>       </qry:and-query>
>       </qry:or-query>
>     </qry:and-query>
>   </qry:final-plan>
>   <qry:info-trace>Selected 0 fragments</qry:info-trace>
>   <qry:result estimate="0"/>
> </qry:query-plan>
>
> If attr1 or attr2 aren't ubiquitous, then I think the efficient way to
write the
> XPath might be this odd-looking expression:
>
>     //elem[@attr1][@attr2][@att1 = @att2]
>
> -- Mike
>
> On 12 Feb 2012, at 01:16 , Geert Josten wrote:
>
> > I am trying to isolate some specific element with two attributes who's
> > values are equal. I know I can use an expression like
doc()//elem[@att1 =
> > @att2], which is even fully searchable according to xdmp:plan, but I'd
> > prefer a cts:query, which I could pass into
cts:element-attribute-values.
> > I need unique values, and I am trying to prevent using distinct-values
on
> > the above XPath expression..
> >
> > Any suggestions?
> >
> > Kind regards,
> > Geert
> >
> > drs. G.P.H. (Geert) Josten
> > Senior Developer
> >
> >
> >
> > Dayon B.V.
> > Delftechpark 37b
> > 2628 XJ Delft
> >
> > T +31 (0)88 26 82 570
> >
> > [email protected]
> > www.dayon.nl
> >
> > De informatie - verzonden in of met dit e-mailbericht - is afkomstig
van
> > Dayon BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit
> > bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen.
Aan
> > dit bericht kunnen geen rechten worden ontleend.
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
> >
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Cts query for element attribute value matching another attribute value?

Reply via email to