I think the key bit is here: “MarkLogic indexes work at the fragment/document level. So doing a reverse query 20 times against different subparts of a document is going to involve brute force filtering to see if the match was in the needed part or not.”
That suggests that our general approach to using reverse queries is flawed for this reason and would explain the apparent poor performance. It’s not possible to break the current docs into smaller docs but it might be possible to configure fragmentation at a level where each fragment would only have one element we need to match on (e.g., titles). Another question: having gotten a result from a reverse search at the full document level, is there a way to know *which* queries matched? If so then it would be easy enough to apply those queries to the relevant elements to do additional filtering (although I suppose that might get us back to the same place). Unfortunately my current performance metrics are “it takes way too long now and needs to take at least ½ as long”. I need to do more work to get some useful measurements and do some calculations to determine what a reasonable performance should be (e.g., we have X million cases to check and 100ms per it should take about Y time but it takes Y*n time—why?). Ultimately I need to try to determine how fast it *should* be for this type of operation. If I can determine that then I can determine whether the throughput requirements can be met by simply achieving that performance with the current server configuration or determine that it cannot and that we need to scale up, e.g, add additional D-nodes or something. I realize that nobody can offer me solid numbers based on what little I can share about the project details, other than to suggest some bounds. In particular, if I have 125,000 reverse queries applied to a single document (assuming that total database volume doesn’t affect query speed in this case) on a modern fast server with appropriate indexes in place, how fast should I expect that query to take? 1ms?, 10ms?, 100ms? 1 second? Based on my experience with ML and the documentation I would expect something around 10ms. Our corpus has about 25 million elements that would be fragments per the advice above (about 1.5 million full documents). If we assume 10ms per query per fragment then it would take about 3 days to process all of them. Currently it takes 9, so roughly a 3x slowdown over what I think we could expect +/- 1 day (there’s other overhead in this 9-day number that may or may not be reduceable). I’ve never done much with fragments in MarkLogic so I’m not sure what the full implication of making these subelements into fragments would be for other processing. Cheers, Eliot -- Eliot Kimber http://contrext.com On 5/1/17, 9:43 PM, "Jason Hunter" <[email protected] on behalf of [email protected]> wrote: So what's the performance you're seeing? And what do you expect to be able to see? Something to consider: MarkLogic indexes work at the fragment/document level. So doing a reverse query 20 times against different subparts of a document is going to involve brute force filtering to see if the match was in the needed part or not. Might be better to have 20 documents instead of 1. -jh- > On May 2, 2017, at 01:29, Eliot Kimber <[email protected]> wrote: > > Actually, its expected that every element will be matched by at least one query. This is a classification application and the intent of the application is that every element of interest will be classified. Many, if not most, of the queries depend on word-search features, e.g., stemmed matches, case insensitivity, etc. > > I’m new to this project so it may be that there is a better way to approach the problem in general. This is the system as currently implemented. > > My overall charge is to improve the throughput performance so my first task is to first understand what the performance bottlenecks are then identify possible solutions. > > It seems unlikely that we’ve done something silly in our queries or ML configuration but I want to eliminate the easy-to-fix before exploring more complicated options. > > Cheers, > > Eliot > > -- > Eliot Kimber > http://contrext.com > > > > On 5/1/17, 12:10 PM, "Jason Hunter" <[email protected] on behalf of [email protected]> wrote: > >> The processing is, for each document to be processed, examine on the order of 10-20 elements to see if they match the reverse query by getting the node to be looked up and then doing: > > Maybe you can reverse query on the document as a whole instead of running 20 reverse queries per document. Only bother with the enumeration of the 20 if there's a proven hit within the document. > > (I assume the vast majority of the time there's not going to be hits. If that's true then why not prove that in one pop instead of 20 pops.) > > -jh- > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > > > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
