We monitor an ingestion process by polling MarkLogic and display the status in 
a browser UI. Near the end of ingestion and until in-memory stands are written 
to disk, one of these queries typically takes 5+ secs to run. However, once the 
stands are written to disk, the same query runs in under 50 ms. 

The ingestion process may insert and delete 10,000-50,000 documents. Of those 
about a dozen will contain metadata related to the import job, and they are 
marked up like:

<metadata-items>
  <type1>valueA</type1>
  <type2>valueB</type2>
  ...

There are four types total, each type is associated with an element range 
index, and each metadata document typically contains 10,000-80,000 of these 
type elements. This is the query:

cts:count-aggregate(
  cts:element-reference(xs:QName('type1'),...,xs:QName('type4')), 
  "item-frequency", 
  cts:directory-query($content-dir, 'infinity'))

To possibly isolate the under-the-hood process responsible for counting index 
entries, we compared performance of this query to 
cts:frequency(cts:values(*[same parameters as above]*)), but there was a 
disparity there too. During ingestion, the cts:frequency()-based query is more 
than 5x as fast as the count-aggregate() query; after the stands have been 
written to disk, count-aggregate() is 20x faster.

MarkLogic documentation explains that in-memory stands are optimized for 
ingestion speed and on-disk stands are optimized for queries, but I am 
surprised to see a 100x difference in performance. Is this expected? A 
pathological use case? Possibly performance bug? And in either case, I'm 
confused why these two methods of counting the same index frequency values 
would be so different, depending on the state of the stands.

Any thoughts?

-Will
_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to