This is a process that is performed almost constantly as new material is added to the corpus or the classification details are refined, both of which happen all the time.
One of the required features of the system is to produce a report of what the new classification would be if the fully classification process was applied to the current content so that those responsible for the classification can evaluate its correctness. That process takes so long that it risks delaying publishing of updated content in the time required by the business process this system serves. I think I have enough to go on now to explore a few possible avenues, as well as gather more precise profiling and performance info. Cheers, E. -- Eliot Kimber http://contrext.com On 5/2/17, 1:04 AM, "Jason Hunter" <[email protected] on behalf of [email protected]> wrote: > By “which query” I mean which of the 125,000 separate query docs actually matched for a given cts:reverse-query() call. cts:search( doc(), cts:reverse-query(doc("newdoc.xml")) ) This will return all the docs containing any serialized queries which would match newdoc.xml. > I guess my question is: in the case where the reverse query is applied to an element that is not a full document, does the “brute force” have to be applied for every candidate query or only for those that match containing document of the input element? In general I avoid putting any xpath in the first arg. In the JavaScript API it's not even possible, because it gives a false sense of optimization. > If the brute force cost is applied to each query then doing a two-phase search would be faster: determine which reverse queries apply to the input document and then use those to find the elements within the input document that actually matched. But if the brute force cost only applies to those queries that match the containing doc then ML internally must produce the faster result than doing it in my own code. > > But as you say, that calls into the question the use of reverse queries at all: why not simply run the 125,000 forward queries and update each element matched as appropriate? Yep. If it's a one-time batch job and you're trying to minimize the time then this would be faster, I bet. > Or it may simply be that we need to do some horizontal scaling and invest in additional D-nodes. You're going to do this often? -jh- _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
