Christian,

That is helpful. Basically you've confirmed my initial analysis that
because BaseX databases are light weight that keeping things simple is the
most appropriate choice.

If I was doing things at scale of course I'd do performance testing to see
where the bottlenecks are, but that is not a concern for what I'm doing
now.

Cheers,

E.
—————
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 5/16/15, 5:10 AM, "Christian Grün" <christian.gr...@gmail.com> wrote:

>Hi Eliot,
>
>As usual, there is no simple answer to such a question. However, I can
>say that sounds like a good choice to use one BaseX database per git
>repository. In contrast to many other dbms, databases in BaseX are
>pretty light-weight containers, and in some of our own use cases we
>even create one database per document.
>
>If you have hundreds or thousands of databases, then it may be
>reasonable to merge them into single units, because it may take too
>much time to access the database directories in the file system. Some
>file systems are better than others in handling large amounts of files
>and directories on the same level. The same observation applies if you
>frequently write queries that access more than one database: It's
>always faster to open a single database (but usually you will only
>notice this when opening a larger number of databases).
>
>Hope this helps,
>Christian
>
>
>On Thu, May 14, 2015 at 3:57 PM, Eliot Kimber <ekim...@contrext.com>
>wrote:
>> In the discussion of adding metadata to a bunch of files Christian
>>points
>> out that you can both limit queries to directories within a single
>> database or apply a query to multiple databases.
>>
>> My question: when or why would you prefer one approach over the other?
>>
>> In my case I'm using BaseX to reflect the XML contents of git
>> repositories. My current approach is to create a separate database for
>> each repo/branch pair, my reasoning being that that makes it easiest to
>> limit queries to just that branch. Because the BaseX data is intended to
>> be a read-only reflecting of the git-managed source, it also makes it
>>easy
>> to clear the data for a branch if it's gotten out of sync (or I suspect
>> it's gotten out of sync) by simply dropping the database.
>>
>> I have complete control over the queries (through a library of functions
>> that understand the git nature of the databases), so I could just as
>> easily use a single database with subdirectories that reflect the repos
>> and branches.
>>
>> In this scenario, as an example, is there any compelling reason to use
>>one
>> approach or the other?
>>
>> I like having one database per branch because that seems like a natural
>> mapping that generally keeps things simple and more or less obvious
>>(e.g.,
>> doing "list" will show the list of databases, which reflect the repo and
>> branch names in their names).
>>
>> In this application the scale will usually be relatively small: 1000s or
>> 10s of 1000s of individual documents in any given branch but the
>>querying
>> and indexing, which supports maintaining knowledge of the links within
>>the
>> XML content, could get intense.
>>
>> Cheers,
>>
>> Eliot
>>
>> —————
>> Eliot Kimber, Owner
>> Contrext, LLC
>> http://contrext.com
>>
>>
>>
>


Reply via email to