We monitor an ingestion process by polling MarkLogic and display the status in a browser UI. Near the end of ingestion and until in-memory stands are written to disk, one of these queries typically takes 5+ secs to run. However, once the stands are written to disk, the same query runs in under 50 ms.
The ingestion process may insert and delete 10,000-50,000 documents. Of those about a dozen will contain metadata related to the import job, and they are marked up like: <metadata-items> <type1>valueA</type1> <type2>valueB</type2> ... There are four types total, each type is associated with an element range index, and each metadata document typically contains 10,000-80,000 of these type elements. This is the query: cts:count-aggregate( cts:element-reference(xs:QName('type1'),...,xs:QName('type4')), "item-frequency", cts:directory-query($content-dir, 'infinity')) To possibly isolate the under-the-hood process responsible for counting index entries, we compared performance of this query to cts:frequency(cts:values(*[same parameters as above]*)), but there was a disparity there too. During ingestion, the cts:frequency()-based query is more than 5x as fast as the count-aggregate() query; after the stands have been written to disk, count-aggregate() is 20x faster. MarkLogic documentation explains that in-memory stands are optimized for ingestion speed and on-disk stands are optimized for queries, but I am surprised to see a 100x difference in performance. Is this expected? A pathological use case? Possibly performance bug? And in either case, I'm confused why these two methods of counting the same index frequency values would be so different, depending on the state of the stands. Any thoughts? -Will _______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general