This is a process that is performed almost constantly as new material is added 
to the corpus or the classification details are refined, both of which happen 
all the time.

One of the required features of the system is to produce a report of what the 
new classification would be if the fully classification process was applied to 
the current content so that those responsible for the classification can 
evaluate its correctness. That process takes so long that it risks delaying 
publishing of updated content in the time required by the business process this 
system serves.

I think I have enough to go on now to explore a few possible avenues, as well 
as gather more precise profiling and performance info.

Cheers,

E.

--
Eliot Kimber
http://contrext.com
 


On 5/2/17, 1:04 AM, "Jason Hunter" <[email protected] on 
behalf of [email protected]> wrote:

    > By “which query” I mean which of the 125,000 separate query docs actually 
matched for a given cts:reverse-query() call. 
    
    cts:search(
      doc(),
      cts:reverse-query(doc("newdoc.xml"))
    )
    
    This will return all the docs containing any serialized queries which would 
match newdoc.xml.
    
    > I guess my question is: in the case where the reverse query is applied to 
an element that is not a full document, does the “brute force” have to be 
applied for every candidate query or only for those that match containing 
document of the input element? 
    
    In general I avoid putting any xpath in the first arg.  In the JavaScript 
API it's not even possible, because it gives a false sense of optimization.
    
    > If the brute force cost is applied to each query then doing a two-phase 
search would be faster: determine which reverse queries apply to the input 
document and then use those to find the elements within the input document that 
actually matched. But if the brute force cost only applies to those queries 
that match the containing doc then ML internally must produce the faster result 
than doing it in my own code. 
    > 
    > But as you say, that calls into the question the use of reverse queries 
at all: why not simply run the 125,000 forward queries and update each element 
matched as appropriate?
    
    Yep.  If it's a one-time batch job and you're trying to minimize the time 
then this would be faster, I bet.
    
    > Or it may simply be that we need to do some horizontal scaling and invest 
in additional D-nodes.
    
    You're going to do this often?
    
    -jh-
    
    _______________________________________________
    General mailing list
    [email protected]
    Manage your subscription at: 
    http://developer.marklogic.com/mailman/listinfo/general
    


_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to