[basex-talk] How best to manage link data collection and storage?

Eliot Kimber Tue, 10 Dec 2024 08:10:57 -0800

Over the last couple of years I’ve developed the Mirabel system that provides 
DITA link management and query features over large volumes of content (the 
ServiceNow product documentation source). In particular it knows what links to 
what and enables viewing the content with all link information available.


For a given version of the product docs we have about 60K DITA topics and 100 
root maps that organize those topics into publications.

The primary job of Mirabel is to capture all the hyperlink details as defined 
by the DITA source and enable queries about the element-to-element and 
document-to-document relationships established by those links.

My implementation approach for loading the link knowledge uses a multi-step 
process:


  1.  Load the entire source content into a database
  2.  Create a “key space” database that reflects the DITA key-to-resource 
mappings defined by each root DITA map. The key spaces are XQuery maps that map 
from key names to resources identified by their database node IDs (essentially, 
each use of a topic from a map has an associated unique key by which that use 
of the topic can be referenced). The key spaces are a prerequisite for 
resolving cross references to keys from one topic to other topics in the 
context of some root map (the same root map or a different one).
  3.  Create a “link record keeping” database that contains the “where used” 
index for the content.

The where-used index maps element node IDs to a record of every reference to 
that node (cross references, content references, topic references from maps). 
The where-used index is the core data used to know where a given map or topic 
is used, which is used to answer questions like “what publications use this 
topic?” or “is this topic used at all?”. The where-used table is constructed as 
an XQuery map that is then turned into XML for storage (I implemented this 
before BaseX added direct storage of maps but given the size I think it still 
makes sense to store it as XML, but I could be wrong).
     *   Process all map-to-map and map-to-topic references and create the 
initial map entries, one for each map and topic.
     *   For topics referenced from maps, process all topic-to-topic references 
and update the records for each target topic to reflect the references to it. 
The map context of a given topic determines the targets of key references from 
that topic, so it is necessary to process the topics in the context of the root 
maps that use them (in DITA, root maps determine the key-to-resource bindings 
to which key references resolve).
     *   For topics not referenced from any maps, add entries for them to the 
where-used table and process any topic-to-topic references (key references 
cannot be resolved but direct URI references can be).

Convert the XQuery map to a single XML document and store in the link record 
keeping database. The resulting database takes about 150MB of storage.

This third step can take two-to-three hours: 60K topics times 0.2 seconds for 
each topic is 3.3 hours. 0.1 seconds is about as fast as the link processing 
can go based on my testing.

This is all done using temporary databases so as not to disturb the working 
databases used by the running Mirabel web application. The work is performed by 
a BaseX “worker” server, not the main server that serves the web site. I 
essentially have one BaseX http server for each core on my server and allocate 
work to them based on load, so queries coming from the web app will not be 
allocated to a worker currently doing a content update process.

Once all the new link data is loaded, the temporary databases are swapped into 
production by renaming the production databases, renaming the temp databases to 
their production names, then dropping the old databases. (Saying this just now 
I’m realizing that I don’t know how to pause or wait for active queries against 
the in-production databases to finish so I can swap the databases.)

Because all the index entries use node IDs, the content database and 
record-keeping databases have to be put into production at the same time, 
otherwise the content node IDs will be out of sync with the indexed record IDs. 
I’m working on the assumption that renaming databases is essentially 
instantaneous and so I can use that to swap the temp databases into production 
reliably.

I use my job orchestration module 
(https://github.com/ekimbernow/basex-orchestration) to manage the sequence of 
operations, where each job calls the next job in the sequence once it has 
finished.

This process works reliably for smaller volumes of content—for example, a 
content set with only a couple of thousand topics and four or five root maps.

But at full scale I’m consistently seeing that the link record keeping 
database, which only has two large XML documents in it, never completes 
optimization: The database page shows the database with two things in it, but 
when you open the database’s page, they do not show up and the job that 
performs the optimization never completes, leaving the database in a locked 
state. This means the new where-used index can’t be put into production.

I feel like I’m going about this the wrong way to make best use of BaseX and 
avoid this problem with very large databases but I don’t see any obvious 
alternative approaches. But it feels like I’m missing something fundamental or 
making a silly error that I can’t see.

So my question:

How would you solve this problem?

In particular, how would you go about constructing the where-used index in a 
way that works best with BaseX?

Or maybe the question is “should I be updating the in-production database with 
the new data and doing the swapping into the production within the database 
itself?” (i.e., by renaming the where used index document rather than the 
database itself.)

I am currently using 11.6 and can move to 12 once it is released.

Thanks,

Eliot
_____________________________________________
Eliot Kimber
Sr. Staff Content Engineer
O: 512 554 9368

servicenow

servicenow.com<https://www.servicenow.com>
LinkedIn<https://www.linkedin.com/company/servicenow> | 
X<https://twitter.com/servicenow> | 
YouTube<https://www.youtube.com/user/servicenowinc> | 
Instagram<https://www.instagram.com/servicenow>

[basex-talk] How best to manage link data collection and storage?

Reply via email to