Hi James,
as to my experience the approach works.
I've tried to use N Slaves BaseX (DB only) instances coordinated by one
Master BaseX including HTTP server. On the latter I implemented a RestXQ
[1] receiving a HTTP request which was then turned into a XQuery and
executed on the different backend instances through the client module
[2]. I have been able to achieve a scalability of nearly 80% on some
particular queries.
The results were encouraging indeed but a few details related to
parallel computing in general have to be considered.
1) If the query is memory only then you can exploit multi-cores to
actually achieve parallelism. If the query has to heavily access DBs on
disk (most of the cases actually are io-bound) then you should ensure
that your distributed Slaves access different disks or reside on
different PCs otherwise all the gain of parallel computation is lost by
overhead on disk access.
2) Uploading of documents from a single source usually doesn't exploit
data-parallelism because all the data has to be sequentialized through a
distributor node first. So if you are able to generate the data from the
beginning in a distributed way, this could improve the distribution step
too.
3) Another thing to keep in mind is that you should try to perform as
much of the work on the Slaves (for instance query and transformation)
and limit the work on the coordinator to a bare minimum of computing and
memory usage in order to avoid the coordinator to be flooded with the
results produced by all your slaves. I achieved this by injecting
functions as external variables in the XQ to be executed on the Slaves.
Hope this is useful for your work and I'm really looking forward to know
how it will proceed!
Regards,
Marco.
[1] http://docs.basex.org/wiki/RESTXQ
[2] http://docs.basex.org/wiki/Client_Module
On 03/01/2018 19:50, James Sears wrote:
In the past I hit a scalability limit with BaseX - a billion+ nodes
kind of a made querying it a bit slower than I liked.
I thought I'd try and address this, so I’ve written some code and
placed it in GitHub: https://github.com/jameshnsears/xqa-documentation
What I've done is proof of concept, that's all - no way "finished".
I'm emailing the list in the hope that what I've done so far might
generate some constructive criticism. Maybe my approach has potential,
maybe it doesn't?
There are only four components so far, the first three are Docker
containers:
* an ActiveMQ instance
* a load balancer
* a shard
* a command line client exists to load the XML, from file, into an
ActiveMQ queue.
The software requires close to zero configuration. For example, each
shard you start will automatically receive XML from the load balancer.
And the load balancer distributes XML so that each shard holds the
same # of documents.
There's a Travis project associated with the above link - it shows how
easy it is to run the software end to end.
So far my effort is all about ingesting the XML, before I move further
I thought I'd canvass some feedback - so if anyone has any then please
give it :-)
Thanks.