Danny, Mike, Both thnx for your suggestions. Let me elaborate on my case a bit more. Now the XMLPrague demojam is in the past, I can reveal more details.. ;)
I am processing tweets with links, plain links to internet. Most of them are shortened. I put tags around the links, and put the shortened link in an attribute. In my original code I directly continued to resolve the shortened link to the full one, and put that in a second attribute. But resolving links can be extremely slow. So I decided to postpone the resolving, and do that in background processing. I must mention here that I use a persisted lookup 'table' (a collection of simple key/value kind of docs), which prevents the need to access internet to resolve the same url twice. Links that were resolved before, don't need to be postponed. So, some are resolved, others are not. Those not have a short and full attribute (both range indexed for facets and such) that is equal. Both attributes are currently always present. (Perhaps that should change, by the way.) Even though this was just something for the demojam, the case got me intrigued. I initially wrote a bunch not so well thought-through code to process incoming tweets, e.g. simply doing all in one request. But pushing things to the background caused more trouble than I expected. Hence my mail about the stack button etc. I guess the trouble is caused by the fact that the urls, shortened ones just as well, can occur multiple times. For the sake of scalability and performance, I created batches with something like 100 to 500 tweets for background processing, but guess that same shortened links could have occurred in different batches, causing the dead-lock I seemed to have faced. I should mention, that I use the shortened url as uri for the lookup key/value doc. Those non-finishing threads were difficult to debug. I had multiple background processes running (some injecting and enriching tweets, others resolving urls). Shutting injection down seemed to help, but not solve the problem entirely. It would be interesting to reproduce the error again, in an isolated test case, but that won't be easy to find and replicate. Instead, I decided to push the resolving of urls in a fully separate background process. Get all available unique values first, make batches of those, resolve these urls batch-wise (in parallel without trouble), and update the lookup table with each batch. Subsequent background processes will pick up tweets with unresolved urls the same way, and insert resolve urls out of the lookup. Note, urls that have been resolved before, are looked up entirely through value lexicons.. I could also have spawned every individual lookup update. But that sounded really inefficient, clogging the entire task serving. Might make sense though. Only a relatively low percentage of all tweets that are ingested contain links that are not yet in the lookup. Anyhow, the problem with the batch approach was finding the unique values of unresolved shortened urls. And doing so in a scalable manner. I guess adding a simple resolved=yes/no attribute makes most sense. Way easiest to isolate the non-resolved ones. Almost identical to Mike's first suggestion. Requires just one straight attr-values call with a bit of pagination algorithm to create batches. The same might be achievable by dropping the full attributes that are identical to the short attribute, though the not-query is usually a bit tricky. And I would have to delete those full attributes. Should be doable from CQ though. (Glad this is just experimental and not production work!) Second best might be the elem-attr-value-query with the nested elem-attr-values of the other attrib. But that doesn't sound very scalable. With the current database (300k tweets, 50k containing links, 14k unresolved links), that would mean passing in 14k values to the value-query. The nr of unresolved links would depend on the backlog, but it could grow at some point. What would happen if it would reach 100k, or beyond? Not sure about the performance of Danny's approach. I think it would result in an or-query containing 14k and-queries, currently. Might work, but quick? ;-) The co-occurrence thing got me thinking though. Could the co-occurrence functions be of use? A lot to think about. Thnx so far.. Kind regards, Geert > -----Oorspronkelijk bericht----- > Van: [email protected] [mailto:general- > [email protected]] Namens Michael Blakeley > Verzonden: zondag 12 februari 2012 19:04 > Aan: MarkLogic Developer Discussion > Onderwerp: Re: [MarkLogic Dev General] Cts query for element attribute value > matching another attribute value? > > No easy suggestions, but I have a couple. > > If this is a common query and needs to be very efficient, then I would create a > node that represents the fact that elem/@attr1 eq elem/@attr2. Then you could > use something like: > > cts:element-attribute-value-query(xs:QName('elem'), xs:QName('attrs-equal'), > '1'). > > If you prefer fewer XML changes and more database round-trips, you could put > a range index on elem/@attr1 (or elem/@attr2, but let's use attr1 for now). > Then you could write: > > cts:element-attribute-value-query( > xs:QName('elem'), xs:QName('attr2'), > cts:element-attribute-values( > xs:QName('elem'), xs:QName('attr1'))) > > You could speed that query up even further by configuring a range index on both > element-attribute pairs, and then using cts:element-attribute-range-query with > '=' instead of cts:element-attribute-value-query. But the 'attrs-equal' approach > will still outperform it most of the time, and won't use as much memory or disk > space. > > Why is any of this necessary? Because the cts:query constructors are designed > to be searchable - that is, to allow efficient index lookups. But there is no index > that would answer your question directly. There is an index entry for every value > of //elem/@attr1[. eq VALUE] and for //elem/@attr2[. eq VALUE] but no index > entry that says //elem/@attr1[. eq attr2]. So the most straightforward way to > resolve the XPath is to retrieve every match for //elem[@attr1][@attr2] and > then filter in memory to check the @attr1 eq @attr2 portion. > > When xdmp:plan says "Path is fully searchable", I think it's only talking about any > node-name and node-type step(s), not any predicates. When I call it on your > XPath expression, it shows only one constraint, and I get exactly the same plan if > I simply drop the predicate. So I think it's really only searching on //elem, and > ignoring both attributes until the filtering phase (that's with 5.0-2). > > Compare these plans: > > xdmp:plan(//elem[@att1 = @att2]), > xdmp:plan(//elem), > xdmp:plan(//elem[@attr1]) > => > <qry:query-plan xmlns:qry="http://marklogic.com/cts/query"> > <qry:info-trace>xdmp:eval("xdmp:plan(//elem[@att1 = > @att2]),&#13;&#10;xdmp:plan(//elem),&#1...", (), <options > xmlns="xdmp:eval"><database>18400529833056734238</database > ><root>/Users/mblakele/S...</options>)</qry:info-trace> > <qry:info-trace>Analyzing path: fn:collection()/descendant::elem[@att1 = > @att2]</qry:info-trace> > <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace> > <qry:info-trace>Step 2 is searchable: descendant::elem[@att1 = > @att2]</qry:info-trace> > <qry:info-trace>Path is fully searchable.</qry:info-trace> > <qry:info-trace>Gathering constraints.</qry:info-trace> > <qry:info-trace>Executing search.</qry:info-trace> > <qry:final-plan> > <qry:and-query> > <qry:term-query weight="0"> > <qry:key>7128167059298760147</qry:key> > </qry:term-query> > </qry:and-query> > </qry:final-plan> > <qry:info-trace>Selected 0 fragments</qry:info-trace> > <qry:result estimate="0"/> > </qry:query-plan> > <qry:query-plan xmlns:qry="http://marklogic.com/cts/query"> > <qry:info-trace>xdmp:eval("xdmp:plan(//elem[@att1 = > @att2]),&#13;&#10;xdmp:plan(//elem),&#1...", (), <options > xmlns="xdmp:eval"><database>18400529833056734238</database > ><root>/Users/mblakele/S...</options>)</qry:info-trace> > <qry:info-trace>Analyzing path: fn:collection()/descendant::elem</qry:info- > trace> > <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace> > <qry:info-trace>Step 2 is searchable: descendant::elem</qry:info-trace> > <qry:info-trace>Path is fully searchable.</qry:info-trace> > <qry:info-trace>Gathering constraints.</qry:info-trace> > <qry:info-trace>Executing search.</qry:info-trace> > <qry:final-plan> > <qry:and-query> > <qry:term-query weight="0"> > <qry:key>7128167059298760147</qry:key> > </qry:term-query> > </qry:and-query> > </qry:final-plan> > <qry:info-trace>Selected 0 fragments</qry:info-trace> > <qry:result estimate="0"/> > </qry:query-plan> > <qry:query-plan xmlns:qry="http://marklogic.com/cts/query"> > <qry:info-trace>xdmp:eval("xdmp:plan(//elem[@att1 = > @att2]),&#13;&#10;xdmp:plan(//elem),&#1...", (), <options > xmlns="xdmp:eval"><database>18400529833056734238</database > ><root>/Users/mblakele/S...</options>)</qry:info-trace> > <qry:info-trace>Analyzing path: > fn:collection()/descendant::elem[@attr1]</qry:info-trace> > <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace> > <qry:info-trace>Step 2 is searchable: descendant::elem[@attr1]</qry:info- > trace> > <qry:info-trace>Path is fully searchable.</qry:info-trace> > <qry:info-trace>Gathering constraints.</qry:info-trace> > <qry:info-trace>Step 2 predicate 1 contributed 1 constraint: @attr1</qry:info- > trace> > <qry:partial-plan> > <qry:term-query weight="0"> > <qry:key>11100480210632785569</qry:key> > </qry:term-query> > </qry:partial-plan> > <qry:info-trace>Step 2 predicate 1 contributed 1 constraint: @attr1</qry:info- > trace> > <qry:partial-plan> > <qry:term-query weight="0"> > <qry:key>11100480210632785569</qry:key> > </qry:term-query> > </qry:partial-plan> > <qry:info-trace>Executing search.</qry:info-trace> > <qry:final-plan> > <qry:and-query> > <qry:or-query> > <qry:element-query> > <qry:key>7128167059298760147</qry:key> > <qry:and-query> > <qry:term-query weight="0"> > <qry:key>11100480210632785569</qry:key> > </qry:term-query> > </qry:and-query> > </qry:element-query> > <qry:and-query> > <qry:term-query weight="0"> > <qry:key>11397336598217694489</qry:key> > </qry:term-query> > <qry:term-query weight="0"> > <qry:key>7128167059298760147</qry:key> > </qry:term-query> > <qry:term-query weight="0"> > <qry:key>11100480210632785569</qry:key> > </qry:term-query> > </qry:and-query> > </qry:or-query> > </qry:and-query> > </qry:final-plan> > <qry:info-trace>Selected 0 fragments</qry:info-trace> > <qry:result estimate="0"/> > </qry:query-plan> > > If attr1 or attr2 aren't ubiquitous, then I think the efficient way to write the > XPath might be this odd-looking expression: > > //elem[@attr1][@attr2][@att1 = @att2] > > -- Mike > > On 12 Feb 2012, at 01:16 , Geert Josten wrote: > > > I am trying to isolate some specific element with two attributes who's > > values are equal. I know I can use an expression like doc()//elem[@att1 = > > @att2], which is even fully searchable according to xdmp:plan, but I'd > > prefer a cts:query, which I could pass into cts:element-attribute-values. > > I need unique values, and I am trying to prevent using distinct-values on > > the above XPath expression.. > > > > Any suggestions? > > > > Kind regards, > > Geert > > > > drs. G.P.H. (Geert) Josten > > Senior Developer > > > > > > > > Dayon B.V. > > Delftechpark 37b > > 2628 XJ Delft > > > > T +31 (0)88 26 82 570 > > > > [email protected] > > www.dayon.nl > > > > De informatie - verzonden in of met dit e-mailbericht - is afkomstig van > > Dayon BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit > > bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan > > dit bericht kunnen geen rechten worden ontleend. > > _______________________________________________ > > General mailing list > > [email protected] > > http://developer.marklogic.com/mailman/listinfo/general > > > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
