Charles - On Thu, May 19, 2022 at 12:46 PM Charles Bearden <cfbmd...@gmail.com> wrote:
> Thanks to Graydon, Tamara, and Christian for responding! > > I figured out a pretty fast way to exploit the infrastructure I had built > (the files allocated out into many databases and a single index database > generated from the databases). > > Here is a sample record from my index database: > > > > > > > > *<entry> <dbname>pmed_updates_b</dbname> <pmid>34239076</pmid> > <version>1</version> <path>pubmed22n1145.xml</path> > <date_revised>2022-01-09</date_revised></entry>* > > As it happens, there are eight versions of this record scattered across 7 > of the component databases and located in 8 input files (two of the input > files were allocated to one of the databases). Each of these instances has > an entry in the index database. > > My approach has four steps: > > 1. retrieve all entries from the index database that have the desired > PMID; > 2. convert the sequence of XML entries into a sequence of maps with > the same data, ordering by filename descending, so that the most recent > file is the first element of the sequence; > 3. take the first item/map of the sequence; > 4. look up all occurrences of records with that PMID in the database > specified in the first item and call *db:path()* on each item and > compare it to the filename specified in the most recent record; the record > whose *db:path()* matches the item/map taken in step three is the most > recent version of the record with that PMID. > > Files are allocated by modulo to the different databases, so it is > conceivable that a database will have more than one record with a given > PMID, hence the necessity of comparing each record's path with the one > given in the map from step three to determine which is the most recent. > > Very neat. I had a thought that `db:list-details()`, specifically the 2nd signature, would be useful here but now that I've 1) read your solution, and 2) tried to play with some examples, I don't think it would be a very helpful fit. > Given the above PMID (for which there are eight versions of the record, as > noted above) it took less than half a second to retrieve the most recent > instance of that record out of over 35 million records. > > I can post the XQuery if anyone wants to see it. It would take longer to > document how I build the content & index databases, and I still have to > work out the best way to keep it all up to date. > > Selfishly, I'd be very interested in seeing examples but don't put yourself through any trouble. All the best, > Chuck > -- > Sr Systems Analyst > University of Texas M.D. Anderson Cancer Center > > Thanks for the interesting example. Best, Bridger