Re: [MarkLogic Dev General] Efficient iterating over values from a large lexicon

Damon Feldman Thu, 02 Jun 2011 10:39:59 -0700

Steve,

Yes, that's how it should work.  [1 to 100] is optimized like limit=100. 
However [1000000 to 1000100] needs to do more work. Your use of "Z" to start at 
a particular value in the lexicon is accomplished by a binary search over the 
in-memory sorted list that is the lexicon, so works fast.


Your examples are a bit apples-to-oranges because

   cts:element-values( xs:QName('lexi'), ("Z"), "limit=1000" )

should really be compared to

    cts:element-values( xs:QName('lexi'), ("Z"))[1 to 1000]

rather than [1700000 to 1701000].

For 1.7MM+ items, I'd definitely consider CORB, but let us know if the task 
server approach works. Either the task server or CORB will run multithreaded 
which is a benefit if you have multiple cores.



The warning in the docs about xdmp:spawn() and updates is just to say that once 
spawned, you have a separate, asynchronous transaction that can't be rolled 
back, so should be aware of that. The other issue is that if the server is shut 
down during the overall batch, the tasks will be lost.



Yours,

Damon


________________________________________
From: Steve Mallen [[email protected]]
Sent: Thursday, June 02, 2011 10:37 AM
To: Damon Feldman
Cc: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Efficient iterating over values from a 
large lexicon

Hi Damon,

Having run a couple of tests, it seems that doing [$start to $end] is in
fact much slower than using limit from a start value.  Running:

     cts:element-values( xs:QName('lexi'), ("Z"), "limit=1000" )

takes 4 seconds

while

     cts:element-values( xs:QName('lexi') )[1700000 to 1701000]

take 0.003 seconds.

So it seems that the second query is not optimised and in fact loads all
the values into memory first before doing the array slice.

Is this what you expect?

Cheers,
-Steve

On 02/06/2011 14:27, Damon Feldman wrote:

> Steve,
>
> xdmp:spawn() with a high task server queue size will work fine. You could 
> also use CORB, which is a java utility.
>
> As for your existing approach, cts:element-values[$start to $end] will work 
> fine and return an empty sequence past the end of the values, and will be 
> optimized. To get the total number you can count them, since this is a 
> lexicon-only function and returns from the indexes without much overhead - no 
> estimate is necessary.
>
> Yours,
> Damon
> ________________________________________
> From: [email protected] 
> [[email protected]] On Behalf Of Steve Mallen 
> [[email protected]]
> Sent: Thursday, June 02, 2011 6:32 AM
> To: General MarkLogic Developer Discussion
> Subject: [MarkLogic Dev General] Efficient iterating over values from a large 
> lexicon
>
> Hi all,
>
> I'm having problems processing a large lexicon of values and wondered if
> anyone had done something similar or had any ideas of how best to deal
> with them.
>
> Basically, I've got a set of several million distinct values, and I want
> to precompute a bunch of statistics for each of them (so that I can then
> facet/sort values on the computed statistic).  So, my plan is to fetch
> all the values from the lexicon (storing them in a temp file, say), and
> then run an XQuery on each value and store the resulting information in
> a document (i.e. one stat document per value).  I cannot do this in a
> single query as it would take far too long to iterate over all values
> and for all the computations and inserts.
>
> But I can't seem to figure out the best way of fetching and iterating
> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
> values).  In SQL, I'd use a CURSOR to fetch the values one by one, and
> then close the cursor at the end.  There doesn't seem to be an analogous
> concept in XQuery or XCC.  I've tried something along the following lines:
>
>       (cts:element-values( xs:QName(lexi) ))[$start to $end]
>
> and fetching the values in blocks until I run out of values but I'm
> worried that this isn't very efficient, and I've got this nagging doubt
> that the above will never return the empty sequence when $start is past
> the end of the values.  I'm not even sure how I should get a count of
> the number of distinct values (xdmp:estimate doesn't work on the result
> of cts:element-values()).
>
> So - do you guys know of a way of efficiently iterating over a large set
> of lexicon values without timing out the query on the server?
>
> If I'm missing an obvious solution, please let me know...
>
> -Steve
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Efficient iterating over values from a large lexicon

Reply via email to