Thanks Damon,

Good to know that the sequence slice method will be optimised.  I think 
I will do it that way to start with and see how it goes.

I'm not sure of the advantages of using xdmp:spawn() though? - I've 
never used it before.  Since I will be creating documents as I go (one 
per lexicon value) is this something I should beware of?  The docs say:

     "use care or preferably avoid calling xdmp:spawn from a module that 
is performing an update transaction."

I was thinking of just having a controlling Java process which passed 
the start and end values to the (update) query, incrementing the values 
for each invocation.  I would return the number of values processed from 
the query, and stop sending queries once I received an empty sequence.  
Does that sound reasonable?

Many thanks also to all who responded for your suggestions.
-Steve

On 02/06/2011 14:27, Damon Feldman wrote:
> Steve,
>
> xdmp:spawn() with a high task server queue size will work fine. You could 
> also use CORB, which is a java utility.
>
> As for your existing approach, cts:element-values[$start to $end] will work 
> fine and return an empty sequence past the end of the values, and will be 
> optimized. To get the total number you can count them, since this is a 
> lexicon-only function and returns from the indexes without much overhead - no 
> estimate is necessary.
>
> Yours,
> Damon
> ________________________________________
> From: [email protected] 
> [[email protected]] On Behalf Of Steve Mallen 
> [[email protected]]
> Sent: Thursday, June 02, 2011 6:32 AM
> To: General MarkLogic Developer Discussion
> Subject: [MarkLogic Dev General] Efficient iterating over values from a large 
> lexicon
>
> Hi all,
>
> I'm having problems processing a large lexicon of values and wondered if
> anyone had done something similar or had any ideas of how best to deal
> with them.
>
> Basically, I've got a set of several million distinct values, and I want
> to precompute a bunch of statistics for each of them (so that I can then
> facet/sort values on the computed statistic).  So, my plan is to fetch
> all the values from the lexicon (storing them in a temp file, say), and
> then run an XQuery on each value and store the resulting information in
> a document (i.e. one stat document per value).  I cannot do this in a
> single query as it would take far too long to iterate over all values
> and for all the computations and inserts.
>
> But I can't seem to figure out the best way of fetching and iterating
> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
> values).  In SQL, I'd use a CURSOR to fetch the values one by one, and
> then close the cursor at the end.  There doesn't seem to be an analogous
> concept in XQuery or XCC.  I've tried something along the following lines:
>
>       (cts:element-values( xs:QName(lexi) ))[$start to $end]
>
> and fetching the values in blocks until I run out of values but I'm
> worried that this isn't very efficient, and I've got this nagging doubt
> that the above will never return the empty sequence when $start is past
> the end of the values.  I'm not even sure how I should get a count of
> the number of distinct values (xdmp:estimate doesn't work on the result
> of cts:element-values()).
>
> So - do you guys know of a way of efficiently iterating over a large set
> of lexicon values without timing out the query on the server?
>
> If I'm missing an obvious solution, please let me know...
>
> -Steve
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to