Hi James,
as to my experience the approach works.
I've tried to use N Slaves BaseX (DB only) instances coordinated by one Master BaseX including HTTP server. On the latter I implemented a RestXQ [1] receiving a HTTP request which was then turned into a XQuery and executed on the different backend instances through the client module [2]. I have been able to achieve a scalability of nearly 80% on some particular queries. The results were encouraging indeed but a few details related to parallel computing in general have to be considered. 1) If the query is memory only then you can exploit multi-cores to actually achieve parallelism. If the query has to heavily access DBs on disk (most of the cases actually are io-bound) then you should ensure that your distributed Slaves access different disks or reside on different PCs otherwise all the gain of  parallel computation is lost by overhead on disk access. 2) Uploading of documents from a single source usually doesn't exploit data-parallelism because all the data has to be sequentialized through a distributor node first. So if you are able to generate the data from the beginning in a distributed way, this could improve the distribution step too. 3) Another thing to keep in mind is that you should try to perform as much of the work on the Slaves (for instance query and transformation) and limit the work on the coordinator to a bare minimum of computing and memory usage in order to avoid the coordinator to be flooded with the results produced by all your slaves. I achieved this by injecting functions as external variables in the XQ to be executed on the Slaves.

Hope this is useful for your work and I'm really looking forward to know how it will proceed!
Regards,
Marco.

[1] http://docs.basex.org/wiki/RESTXQ
[2] http://docs.basex.org/wiki/Client_Module

On 03/01/2018 19:50, James Sears wrote:

In the past I hit a scalability limit with BaseX - a billion+ nodes kind of a made querying it a bit slower than I liked.

I thought I'd try and address this, so I’ve written some code and placed it in GitHub: https://github.com/jameshnsears/xqa-documentation

What I've done is proof of concept, that's all - no way "finished". I'm emailing the list in the hope that what I've done so far might generate some constructive criticism. Maybe my approach has potential, maybe it doesn't?

There are only four components so far, the first three are Docker containers:

* an ActiveMQ instance

* a load balancer

* a shard

* a command line client exists to load the XML, from file, into an ActiveMQ queue.

The software requires close to zero configuration. For example, each shard you start will automatically receive XML from the load balancer. And the load balancer distributes XML so that each shard holds the same # of documents.

There's a Travis project associated with the above link - it shows how easy it is to run the software end to end.

So far my effort is all about ingesting the XML, before I move further I thought I'd canvass some feedback - so if anyone has any then please give it :-)

Thanks.


Reply via email to