By “which query” I mean which of the 125,000 separate query docs actually matched for a given cts:reverse-query() call.
I guess my question is: in the case where the reverse query is applied to an element that is not a full document, does the “brute force” have to be applied for every candidate query or only for those that match containing document of the input element? If the brute force cost is applied to each query then doing a two-phase search would be faster: determine which reverse queries apply to the input document and then use those to find the elements within the input document that actually matched. But if the brute force cost only applies to those queries that match the containing doc then ML internally must produce the faster result than doing it in my own code. But as you say, that calls into the question the use of reverse queries at all: why not simply run the 125,000 forward queries and update each element matched as appropriate? Or it may simply be that we need to do some horizontal scaling and invest in additional D-nodes. Cheers, E. -- Eliot Kimber http://contrext.com On 5/1/17, 10:26 PM, "Jason Hunter" <[email protected] on behalf of [email protected]> wrote: > Another question: having gotten a result from a reverse search at the full document level, is there a way to know *which* queries matched? If so then it would be easy enough to apply those queries to the relevant elements to do additional filtering (although I suppose that might get us back to the same place). I'm a little confused. You're putting multiple serialized queries into each document? If you have just one serialized query in a document it's going to be obvious which query was the reverse match -- it was that one. > In particular, if I have 125,000 reverse queries applied to a single document (assuming that total database volume doesn’t affect query speed in this case) on a modern fast server with appropriate indexes in place, how fast should I expect that query to take? 1ms?, 10ms?, 100ms? 1 second? If you have 125,000 documents each with a serialized query in it and you do a reverse query for one document against those serialized queries and there's no hits, it should be extremely fast. More hits will slow things a little bit because hits involve a little work. The IMLS paper explains what the algorithm has to do. I suspect (but haven't measured) that it's a lot like forward queries in that the timing depends a lot on number of matches. > Our corpus has about 25 million elements that would be fragments per the advice above (about 1.5 million full documents). If you have 25 million elements you want to run against 125,000 serialized queries, wouldn't forward queries be faster? You'd only have to do 125,000 search calls instead of 25,000,000. :) > I’ve never done much with fragments in MarkLogic so I’m not sure what the full implication of making these subelements into fragments would be for other processing. Yeah, fragmentation is not to be done lightly. -jh- _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
