Steve, that makes sense to me. In order to use a range index on that sum, you
would have to put its value into the database in an XML element or
element-attribute. You could do that. You could move that work to ingestion
time, using a CPF pipeline. The pipeline would use triggers to update the
relevant meta-documents or summary documents for whatever values are in each
update. That would turn each summary document into a serialization point, but
that might be ok depending on your performance requirements and the frequency
of updates.
But if you stay on the path you are on now, I would recommend using xdmp:spawn.
That will let your code take better advantage of parallel processing. I would
structure this using at least two modules: one to process a single batch of N
values, and one to spawn off M batches of N values each. Here's an outline: you
could tweak the batch size and element-values call to suit your needs. You
might need a larger batch to avoid recursion limits, for example.
(: list all uris and recursively spawn batches :)
declare variable $SIZE := 100 ;
declare function local:spawn(
$values as xs:string* )
as empty-sequence()
{
let $batch := subsequence($uris, 1, $SIZE)
let $rest := subsequence($uris, 1 + $SIZE)
let $log := xdmp:log(text {
"batch count", count($batch),
"rest", count($rest) },
"info")
let $spawn := if (empty($batch)) then () else xdmp:spawn(
$MODULE-BATCH,
(xs:QName("VALUES-JSON"), xdmp:to-json($batch)),
$OPTIONS )
where exists($rest)
return local:spawn($rest)
};
local:spawn(cts:element-values(...))
(: process one batch :)
declare variable $VALUES-JSON as xs:string external ;
xdmp:log(text { "batch", $VALUES-JSON }, "info"),
for $v in xdmp:from-json($VALUES-JSON) ;
(: do stuff :)
-- Mike
On 2 Jun 2011, at 07:47 , Steve Mallen wrote:
> Hi Nuno,
>
> Unfortunately, I don't think it's that simple. I have a set of documents
> each which each contain a single integer statistic (let's call it "number of
> sales" for sake of argument). This datum is not computed by our system, but
> is provided by an external party and matched to each document. Each document
> (item) also contains metadata about the item, such as title, color, flavour
> etc. These have been put into lexicons for fast faceting of search results.
>
> But we now have a new requirement whereby we want to know the the *total*
> number of sales for each facet, and show the top ranking (highest total
> sales) of colors, titles, etc. So I effectively need to order by the sum of
> all "number of sales" of all items matching a facet. As far as I know there
> is no way to facet on a computed value. To add to the problem, some of the
> lexicons have millions of distinct values.
>
> Therefore the only solution I can think of is to iterate over all distinct
> values and pre-compute these values. I can then add a range index on the
> computed value and order by this sum.
>
> Hope I have explained this clearly...
>
> -Steve
>
> On 02/06/2011 15:35, Nuno Job wrote:
>> An example of what Michael said would be use element values with option
>> frequency order and cts:frequency. You might be assuming we can't do
>> something that we are perfectly optimized to do.
>>
>> If that's not the case the recommendations on spawn and corb make sense. As
>> for spawn with an update: just remember spawn won't be rolled back and that
>> might be why the documentation says that. Ideally you should have a query
>> statement that spawns update statements.
>>
>> Makes sense?
>>
>> Nuno
>>
>> On Jun 2, 2011 3:06 PM, "Michael Blakeley" <[email protected]> wrote:
>> > Steve, the suggestions related to spawn and limits are good, but you might
>> > want to back up and reconsider the problem. Naturally you know the details
>> > of your problem best, but it might be possible to do some or all of the
>> > work more efficiently using existing product features.
>> >
>> > -- Mike
>> >
>> > On 2 Jun 2011, at 03:32 , Steve Mallen wrote:
>> >
>> >> Hi all,
>> >>
>> >> I'm having problems processing a large lexicon of values and wondered if
>> >> anyone had done something similar or had any ideas of how best to deal
>> >> with them.
>> >>
>> >> Basically, I've got a set of several million distinct values, and I want
>> >> to precompute a bunch of statistics for each of them (so that I can then
>> >> facet/sort values on the computed statistic). So, my plan is to fetch
>> >> all the values from the lexicon (storing them in a temp file, say), and
>> >> then run an XQuery on each value and store the resulting information in
>> >> a document (i.e. one stat document per value). I cannot do this in a
>> >> single query as it would take far too long to iterate over all values
>> >> and for all the computations and inserts.
>> >>
>> >> But I can't seem to figure out the best way of fetching and iterating
>> >> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
>> >> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
>> >> then close the cursor at the end. There doesn't seem to be an analogous
>> >> concept in XQuery or XCC. I've tried something along the following lines:
>> >>
>> >> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>> >>
>> >> and fetching the values in blocks until I run out of values but I'm
>> >> worried that this isn't very efficient, and I've got this nagging doubt
>> >> that the above will never return the empty sequence when $start is past
>> >> the end of the values. I'm not even sure how I should get a count of
>> >> the number of distinct values (xdmp:estimate doesn't work on the result
>> >> of cts:element-values()).
>> >>
>> >> So - do you guys know of a way of efficiently iterating over a large set
>> >> of lexicon values without timing out the query on the server?
>> >>
>> >> If I'm missing an obvious solution, please let me know...
>> >>
>> >> -Steve
>> >>
>> >> _______________________________________________
>> >> General mailing list
>> >> [email protected]
>> >> http://developer.marklogic.com/mailman/listinfo/general
>> >>
>> >
>> > _______________________________________________
>> > General mailing list
>> > [email protected]
>> > http://developer.marklogic.com/mailman/listinfo/general
>>
>> _______________________________________________
>> General mailing list
>>
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general