Re: [MarkLogic Dev General] Efficient iterating over values from a large lexicon

Steve Mallen Thu, 02 Jun 2011 07:47:52 -0700

Hi Nuno,

Unfortunately, I don't think it's that simple. I have a set ofdocuments each which each contain a single integer statistic (let's callit "number of sales" for sake of argument). This datum is not computedby our system, but is provided by an external party and matched to eachdocument. Each document (item) also contains metadata about the item,such as title, color, flavour etc. These have been put into lexiconsfor fast faceting of search results.

But we now have a new requirement whereby we want to know the the*total* number of sales for each facet, and show the top ranking(highest total sales) of colors, titles, etc. So I effectively need toorder by the sum of all "number of sales" of all items matching afacet. As far as I know there is no way to facet on a computed value.To add to the problem, some of the lexicons have millions of distinctvalues.

Therefore the only solution I can think of is to iterate over alldistinct values and pre-compute these values. I can then add a rangeindex on the computed value and order by this sum.


Hope I have explained this clearly...

-Steve

On 02/06/2011 15:35, Nuno Job wrote:

An example of what Michael said would be use element values withoption frequency order and cts:frequency. You might be assuming wecan't do something that we are perfectly optimized to do.
If that's not the case the recommendations on spawn and corb makesense. As for spawn with an update: just remember spawn won't berolled back and that might be why the documentation says that. Ideallyyou should have a query statement that spawns update statements.
Makes sense?

Nuno
On Jun 2, 2011 3:06 PM, "Michael Blakeley" <[email protected]<mailto:[email protected]>> wrote:> Steve, the suggestions related to spawn and limits are good, but youmight want to back up and reconsider the problem. Naturally you knowthe details of your problem best, but it might be possible to do someor all of the work more efficiently using existing product features.
>
> -- Mike
>
> On 2 Jun 2011, at 03:32 , Steve Mallen wrote:
>
>> Hi all,
>>
>> I'm having problems processing a large lexicon of values andwondered if
>> anyone had done something similar or had any ideas of how best to deal
>> with them.
>>
>> Basically, I've got a set of several million distinct values, and Iwant>> to precompute a bunch of statistics for each of them (so that I canthen
>> facet/sort values on the computed statistic). So, my plan is to fetch
>> all the values from the lexicon (storing them in a temp file, say),and>> then run an XQuery on each value and store the resultinginformation in
>> a document (i.e. one stat document per value). I cannot do this in a
>> single query as it would take far too long to iterate over all values
>> and for all the computations and inserts.
>>
>> But I can't seem to figure out the best way of fetching and iterating
>> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
>> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
>> then close the cursor at the end. There doesn't seem to be ananalogous>> concept in XQuery or XCC. I've tried something along the followinglines:
>>
>> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>>
>> and fetching the values in blocks until I run out of values but I'm
>> worried that this isn't very efficient, and I've got this naggingdoubt>> that the above will never return the empty sequence when $start ispast
>> the end of the values. I'm not even sure how I should get a count of
>> the number of distinct values (xdmp:estimate doesn't work on theresult
>> of cts:element-values()).
>>
>> So - do you guys know of a way of efficiently iterating over alarge set
>> of lexicon values without timing out the query on the server?
>>
>> If I'm missing an obvious solution, please let me know...
>>
>> -Steve
>>
>> _______________________________________________
>> General mailing list
>> [email protected]<mailto:[email protected]>
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>
> _______________________________________________
> General mailing list
> [email protected] <mailto:[email protected]>
> http://developer.marklogic.com/mailman/listinfo/general


_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Efficient iterating over values from a large lexicon

Reply via email to