Christian, That is helpful. Basically you've confirmed my initial analysis that because BaseX databases are light weight that keeping things simple is the most appropriate choice.
If I was doing things at scale of course I'd do performance testing to see where the bottlenecks are, but that is not a concern for what I'm doing now. Cheers, E. ————— Eliot Kimber, Owner Contrext, LLC http://contrext.com On 5/16/15, 5:10 AM, "Christian Grün" <christian.gr...@gmail.com> wrote: >Hi Eliot, > >As usual, there is no simple answer to such a question. However, I can >say that sounds like a good choice to use one BaseX database per git >repository. In contrast to many other dbms, databases in BaseX are >pretty light-weight containers, and in some of our own use cases we >even create one database per document. > >If you have hundreds or thousands of databases, then it may be >reasonable to merge them into single units, because it may take too >much time to access the database directories in the file system. Some >file systems are better than others in handling large amounts of files >and directories on the same level. The same observation applies if you >frequently write queries that access more than one database: It's >always faster to open a single database (but usually you will only >notice this when opening a larger number of databases). > >Hope this helps, >Christian > > >On Thu, May 14, 2015 at 3:57 PM, Eliot Kimber <ekim...@contrext.com> >wrote: >> In the discussion of adding metadata to a bunch of files Christian >>points >> out that you can both limit queries to directories within a single >> database or apply a query to multiple databases. >> >> My question: when or why would you prefer one approach over the other? >> >> In my case I'm using BaseX to reflect the XML contents of git >> repositories. My current approach is to create a separate database for >> each repo/branch pair, my reasoning being that that makes it easiest to >> limit queries to just that branch. Because the BaseX data is intended to >> be a read-only reflecting of the git-managed source, it also makes it >>easy >> to clear the data for a branch if it's gotten out of sync (or I suspect >> it's gotten out of sync) by simply dropping the database. >> >> I have complete control over the queries (through a library of functions >> that understand the git nature of the databases), so I could just as >> easily use a single database with subdirectories that reflect the repos >> and branches. >> >> In this scenario, as an example, is there any compelling reason to use >>one >> approach or the other? >> >> I like having one database per branch because that seems like a natural >> mapping that generally keeps things simple and more or less obvious >>(e.g., >> doing "list" will show the list of databases, which reflect the repo and >> branch names in their names). >> >> In this application the scale will usually be relatively small: 1000s or >> 10s of 1000s of individual documents in any given branch but the >>querying >> and indexing, which supports maintaining knowledge of the links within >>the >> XML content, could get intense. >> >> Cheers, >> >> Eliot >> >> ————— >> Eliot Kimber, Owner >> Contrext, LLC >> http://contrext.com >> >> >> >