Re: [MarkLogic Dev General] Efficient iterating over values from a large lexicon

Michael Blakeley Thu, 02 Jun 2011 08:23:29 -0700

Steve, that makes sense to me. In order to use a range index on that sum, you 
would have to put its value into the database in an XML element or 
element-attribute. You could do that. You could move that work to ingestion 
time, using a CPF pipeline. The pipeline would use triggers to update the 
relevant meta-documents or summary documents for whatever values are in each 
update. That would turn each summary document into a serialization point, but 
that might be ok depending on your performance requirements and the frequency 
of updates.


But if you stay on the path you are on now, I would recommend using xdmp:spawn. 
That will let your code take better advantage of parallel processing. I would 
structure this using at least two modules: one to process a single batch of N 
values, and one to spawn off M batches of N values each. Here's an outline: you 
could tweak the batch size and element-values call to suit your needs. You 
might need a larger batch to avoid recursion limits, for example.

(: list all uris and recursively spawn batches :)
declare variable $SIZE := 100 ;

declare function local:spawn(
  $values as xs:string* )
as empty-sequence()
{
  let $batch := subsequence($uris, 1, $SIZE)
  let $rest := subsequence($uris, 1 + $SIZE)
  let $log := xdmp:log(text {
      "batch count", count($batch),
      "rest", count($rest) },
    "info")
  let $spawn := if (empty($batch)) then () else xdmp:spawn(
    $MODULE-BATCH,
    (xs:QName("VALUES-JSON"), xdmp:to-json($batch)),
    $OPTIONS )
  where exists($rest)
  return local:spawn($rest)
};

local:spawn(cts:element-values(...))

(: process one batch :)
declare variable $VALUES-JSON as xs:string external ;

xdmp:log(text { "batch", $VALUES-JSON }, "info"),
for $v in xdmp:from-json($VALUES-JSON) ;
(: do stuff :)

-- Mike

On 2 Jun 2011, at 07:47 , Steve Mallen wrote:

> Hi Nuno,
> 
> Unfortunately, I don't think it's that simple.  I have a set of documents 
> each which each contain a single integer statistic (let's call it "number of 
> sales" for sake of argument).  This datum is not computed by our system, but 
> is provided by an external party and matched to each document.  Each document 
> (item) also contains metadata about the item, such as title, color, flavour 
> etc.  These have been put into lexicons for fast faceting of search results.
> 
> But we now have a new requirement whereby we want to know the the *total* 
> number of sales for each facet, and show the top ranking (highest total 
> sales) of colors, titles, etc.  So I effectively need to order by the sum of 
> all "number of sales" of all items matching a facet.  As far as I know there 
> is no way to facet on a computed value.  To add to the problem, some of the 
> lexicons have millions of distinct values.
> 
> Therefore the only solution I can think of is to iterate over all distinct 
> values and pre-compute these values.  I can then add a range index on the 
> computed value and order by this sum.
> 
> Hope I have explained this clearly...
> 
> -Steve
> 
> On 02/06/2011 15:35, Nuno Job wrote:
>> An example of what Michael said would be use element values with option 
>> frequency order and cts:frequency. You might be assuming we can't do 
>> something that we are perfectly optimized to do.
>> 
>> If that's not the case the recommendations on spawn and corb make sense. As 
>> for spawn with an update: just remember spawn won't be rolled back and that 
>> might be why the documentation says that. Ideally you should have a query 
>> statement that spawns update statements.
>> 
>> Makes sense?
>> 
>> Nuno
>> 
>> On Jun 2, 2011 3:06 PM, "Michael Blakeley" <[email protected]> wrote:
>> > Steve, the suggestions related to spawn and limits are good, but you might 
>> > want to back up and reconsider the problem. Naturally you know the details 
>> > of your problem best, but it might be possible to do some or all of the 
>> > work more efficiently using existing product features.
>> > 
>> > -- Mike
>> > 
>> > On 2 Jun 2011, at 03:32 , Steve Mallen wrote:
>> > 
>> >> Hi all,
>> >> 
>> >> I'm having problems processing a large lexicon of values and wondered if 
>> >> anyone had done something similar or had any ideas of how best to deal 
>> >> with them.
>> >> 
>> >> Basically, I've got a set of several million distinct values, and I want 
>> >> to precompute a bunch of statistics for each of them (so that I can then 
>> >> facet/sort values on the computed statistic). So, my plan is to fetch 
>> >> all the values from the lexicon (storing them in a temp file, say), and 
>> >> then run an XQuery on each value and store the resulting information in 
>> >> a document (i.e. one stat document per value). I cannot do this in a 
>> >> single query as it would take far too long to iterate over all values 
>> >> and for all the computations and inserts.
>> >> 
>> >> But I can't seem to figure out the best way of fetching and iterating 
>> >> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon 
>> >> values). In SQL, I'd use a CURSOR to fetch the values one by one, and 
>> >> then close the cursor at the end. There doesn't seem to be an analogous 
>> >> concept in XQuery or XCC. I've tried something along the following lines:
>> >> 
>> >> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>> >> 
>> >> and fetching the values in blocks until I run out of values but I'm 
>> >> worried that this isn't very efficient, and I've got this nagging doubt 
>> >> that the above will never return the empty sequence when $start is past 
>> >> the end of the values. I'm not even sure how I should get a count of 
>> >> the number of distinct values (xdmp:estimate doesn't work on the result 
>> >> of cts:element-values()).
>> >> 
>> >> So - do you guys know of a way of efficiently iterating over a large set 
>> >> of lexicon values without timing out the query on the server?
>> >> 
>> >> If I'm missing an obvious solution, please let me know...
>> >> 
>> >> -Steve
>> >> 
>> >> _______________________________________________
>> >> General mailing list
>> >> [email protected]
>> >> http://developer.marklogic.com/mailman/listinfo/general
>> >> 
>> > 
>> > _______________________________________________
>> > General mailing list
>> > [email protected]
>> > http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> 
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Efficient iterating over values from a large lexicon

Reply via email to