Re: [basex-talk] BaseX Sharding with ActiveMQ & Docker

2018-01-05 Thread Marco Lettere

Hi James,
as to my experience the approach works.
I've tried to use N Slaves BaseX (DB only) instances coordinated by one 
Master BaseX including HTTP server. On the latter I implemented a RestXQ 
[1] receiving a HTTP request which was then turned into a XQuery and 
executed on the different backend instances through the client module 
[2]. I have been able to achieve a scalability of nearly 80% on some 
particular queries.
The results were encouraging indeed but a few details related to 
parallel computing in general have to be considered.
1) If the query is memory only then you can exploit multi-cores to 
actually achieve parallelism. If the query has to heavily access DBs on 
disk (most of the cases actually are io-bound) then you should ensure 
that your distributed Slaves access different disks or reside on 
different PCs otherwise all the gain of  parallel computation is lost by 
overhead on disk access.
2) Uploading of documents from a single source usually doesn't exploit 
data-parallelism because all the data has to be sequentialized through a 
distributor node first. So if you are able to generate the data from the 
beginning in a distributed way, this could improve the distribution step 
too.
3) Another thing to keep in mind is that you should try to perform as 
much of the work on the Slaves (for instance query and transformation) 
and limit the work on the coordinator to a bare minimum of computing and 
memory usage in order to avoid the coordinator to be flooded with the 
results produced by all your slaves. I achieved this by injecting 
functions as external variables in the XQ to be executed on the Slaves.


Hope this is useful for your work and I'm really looking forward to know 
how it will proceed!

Regards,
Marco.

[1] http://docs.basex.org/wiki/RESTXQ
[2] http://docs.basex.org/wiki/Client_Module

On 03/01/2018 19:50, James Sears wrote:


In the past I hit a scalability limit with BaseX - a billion+ nodes 
kind of a made querying it a bit slower than I liked.


I thought I'd try and address this, so I’ve written some code and 
placed it in GitHub: https://github.com/jameshnsears/xqa-documentation


What I've done is proof of concept, that's all - no way "finished". 
I'm emailing the list in the hope that what I've done so far might 
generate some constructive criticism. Maybe my approach has potential, 
maybe it doesn't?


There are only four components so far, the first three are Docker 
containers:


* an ActiveMQ instance

* a load balancer

* a shard

* a command line client exists to load the XML, from file, into an 
ActiveMQ queue.


The software requires close to zero configuration. For example, each 
shard you start will automatically receive XML from the load balancer. 
And the load balancer distributes XML so that each shard holds the 
same # of documents.


There's a Travis project associated with the above link - it shows how 
easy it is to run the software end to end.


So far my effort is all about ingesting the XML, before I move further 
I thought I'd canvass some feedback - so if anyone has any then please 
give it :-)


Thanks.





[basex-talk] BaseX Sharding with ActiveMQ & Docker

2018-01-03 Thread James Sears
In the past I hit a scalability limit with BaseX - a billion+ nodes kind of a 
made querying it a bit slower than I liked.

I thought I'd try and address this, so I’ve written some code and placed it in 
GitHub: https://github.com/jameshnsears/xqa-documentation

What I've done is proof of concept, that's all - no way "finished". I'm 
emailing the list in the hope that what I've done so far might generate some 
constructive criticism. Maybe my approach has potential, maybe it doesn't?

There are only four components so far, the first three are Docker containers:
* an ActiveMQ instance
* a load balancer
* a shard
* a command line client exists to load the XML, from file, into an ActiveMQ 
queue.

The software requires close to zero configuration. For example, each shard you 
start will automatically receive XML from the load balancer. And the load 
balancer distributes XML so that each shard holds the same # of documents.

There's a Travis project associated with the above link - it shows how easy it 
is to run the software end to end.

So far my effort is all about ingesting the XML, before I move further I 
thought I'd canvass some feedback - so if anyone has any then please give it :-)

Thanks.