Thanks Damon,
Good to know that the sequence slice method will be optimised. I think
I will do it that way to start with and see how it goes.
I'm not sure of the advantages of using xdmp:spawn() though? - I've
never used it before. Since I will be creating documents as I go (one
per lexicon value) is this something I should beware of? The docs say:
"use care or preferably avoid calling xdmp:spawn from a module that
is performing an update transaction."
I was thinking of just having a controlling Java process which passed
the start and end values to the (update) query, incrementing the values
for each invocation. I would return the number of values processed from
the query, and stop sending queries once I received an empty sequence.
Does that sound reasonable?
Many thanks also to all who responded for your suggestions.
-Steve
On 02/06/2011 14:27, Damon Feldman wrote:
> Steve,
>
> xdmp:spawn() with a high task server queue size will work fine. You could
> also use CORB, which is a java utility.
>
> As for your existing approach, cts:element-values[$start to $end] will work
> fine and return an empty sequence past the end of the values, and will be
> optimized. To get the total number you can count them, since this is a
> lexicon-only function and returns from the indexes without much overhead - no
> estimate is necessary.
>
> Yours,
> Damon
> ________________________________________
> From: [email protected]
> [[email protected]] On Behalf Of Steve Mallen
> [[email protected]]
> Sent: Thursday, June 02, 2011 6:32 AM
> To: General MarkLogic Developer Discussion
> Subject: [MarkLogic Dev General] Efficient iterating over values from a large
> lexicon
>
> Hi all,
>
> I'm having problems processing a large lexicon of values and wondered if
> anyone had done something similar or had any ideas of how best to deal
> with them.
>
> Basically, I've got a set of several million distinct values, and I want
> to precompute a bunch of statistics for each of them (so that I can then
> facet/sort values on the computed statistic). So, my plan is to fetch
> all the values from the lexicon (storing them in a temp file, say), and
> then run an XQuery on each value and store the resulting information in
> a document (i.e. one stat document per value). I cannot do this in a
> single query as it would take far too long to iterate over all values
> and for all the computations and inserts.
>
> But I can't seem to figure out the best way of fetching and iterating
> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
> then close the cursor at the end. There doesn't seem to be an analogous
> concept in XQuery or XCC. I've tried something along the following lines:
>
> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>
> and fetching the values in blocks until I run out of values but I'm
> worried that this isn't very efficient, and I've got this nagging doubt
> that the above will never return the empty sequence when $start is past
> the end of the values. I'm not even sure how I should get a count of
> the number of distinct values (xdmp:estimate doesn't work on the result
> of cts:element-values()).
>
> So - do you guys know of a way of efficiently iterating over a large set
> of lexicon values without timing out the query on the server?
>
> If I'm missing an obvious solution, please let me know...
>
> -Steve
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general