Charles -

On Thu, May 19, 2022 at 12:46 PM Charles Bearden <cfbmd...@gmail.com> wrote:

> Thanks to Graydon, Tamara, and Christian for responding!
>
> I figured out a pretty fast way to exploit the infrastructure I had built
> (the files allocated out into many databases and a single index database
> generated from the databases).
>
> Here is a sample record from my index database:
>
>
>
>
>
>
>
> *<entry>  <dbname>pmed_updates_b</dbname>  <pmid>34239076</pmid>
> <version>1</version>  <path>pubmed22n1145.xml</path>
> <date_revised>2022-01-09</date_revised></entry>*
>
> As it happens, there are eight versions of this record scattered across 7
> of the component databases and located in 8 input files (two of the input
> files were allocated to one of the databases). Each of these instances has
> an entry in the index database.
>
> My approach has four steps:
>
>    1. retrieve all entries from the index database that have the desired
>    PMID;
>    2. convert the sequence of XML entries into a sequence of maps with
>    the same data, ordering by filename descending, so that the most recent
>    file is the first element of the sequence;
>    3. take the first item/map of the sequence;
>    4. look up all occurrences of records with that PMID in the database
>    specified in the first item and call *db:path()* on each item and
>    compare it to the filename specified in the most recent record; the record
>    whose *db:path()* matches the item/map taken in step three is the most
>    recent version of the record with that PMID.
>
> Files are allocated by modulo to the different databases, so it is
> conceivable that a database will have more than one record with a given
> PMID, hence the necessity of comparing each record's path with the one
> given in the map from step three to determine which is the most recent.
>
> Very neat. I had a thought that `db:list-details()`, specifically the 2nd
signature, would be useful here but now that I've 1) read your solution,
and 2) tried to play with some examples, I don't think it would be a very
helpful fit.


> Given the above PMID (for which there are eight versions of the record, as
> noted above) it took less than half a second to retrieve the most recent
> instance of that record out of over 35 million records.
>
> I can post the XQuery if anyone wants to see it. It would take longer to
> document how I build the content & index databases, and I still have to
> work out the best way to keep it all up to date.
>
> Selfishly, I'd be very interested in seeing examples but don't put
yourself through any trouble.

All the best,
> Chuck
>
--
> Sr Systems Analyst
> University of Texas M.D. Anderson Cancer Center
>
>

Thanks for the interesting example.
Best,
Bridger

Reply via email to