[basex-talk] BaseX Sharding with ActiveMQ & Docker

2018-01-03 Thread James Sears
In the past I hit a scalability limit with BaseX - a billion+ nodes kind of a 
made querying it a bit slower than I liked.

I thought I'd try and address this, so I’ve written some code and placed it in 
GitHub: https://github.com/jameshnsears/xqa-documentation

What I've done is proof of concept, that's all - no way "finished". I'm 
emailing the list in the hope that what I've done so far might generate some 
constructive criticism. Maybe my approach has potential, maybe it doesn't?

There are only four components so far, the first three are Docker containers:
* an ActiveMQ instance
* a load balancer
* a shard
* a command line client exists to load the XML, from file, into an ActiveMQ 
queue.

The software requires close to zero configuration. For example, each shard you 
start will automatically receive XML from the load balancer. And the load 
balancer distributes XML so that each shard holds the same # of documents.

There's a Travis project associated with the above link - it shows how easy it 
is to run the software end to end.

So far my effort is all about ingesting the XML, before I move further I 
thought I'd canvass some feedback - so if anyone has any then please give it :-)

Thanks.



Re: [basex-talk] Improving performance in a 40GB database

2016-07-11 Thread James Sears
Hi Max - you have given me food for thought.

I was not aware of the selective indexing and will investigate.

Partitioning might well be the answer - I'll ask around and see if the people 
close to the data have some ideas about how data might be logically 
partitioned. Maybe having a master database - without the  elements - 
"pointing" at various database instance partitions is the answer?

Thanks for the ideas.

> From: 
> Date: Fri, 8 Jul 2016 15:24:16 +0200
> Subject: Re: [basex-talk] Improving performance in a 40GB database
> To: james.hn.se...@outlook.com
> CC: basex-talk@mailman.uni-konstanz.de
> 
> Hi James,
> 
> did you see: http://docs.basex.org/wiki/Index#Selective_Indexing
> 
> Otherwise for large amounts of data I found it helpful to create
> another database with only contents that need to be full text
> searchable or otherwise help to limit the scope of the data. Then,
> with more specific information (e.g. ttle or category names etc.)
> fetch the data from the main database.
> 
> As well, databases in BaseX are really lightweight, so you could
> create multiple dbs and spread the 40 GB over several dbs. Then, query
> multiple dbs at once. Could be faster than querying one large db (and
> updating is much faster as well).
> 
> 
> Regards,
> 
> Max

  

Re: [basex-talk] Improving performance in a 40GB database

2016-07-07 Thread James Sears
Hi Liam - at the moment, I'm expecting our most common user case will be to 
search across all the books. It's not possible to be more specific until later 
on in the query.

In fact, it's likely that specific filtering would be via an export of the xml 
that gets imported into excel - where standard filters would be applied by the 
(non technical) user.

> Subject: Re: [basex-talk] Improving performance in a 40GB database
> From: l...@w3.org
> To: james.hn.se...@outlook.com
> CC: basex-talk@mailman.uni-konstanz.de
> Date: Wed, 6 Jul 2016 14:56:43 -0400
> 
> On Wed, 2016-07-06 at 16:02 +0100, James Sears wrote:
> > 
> > let $void := prof:void(for $book in //book
> 
> Are you able to make that //book more specific? or can book elements
> occur at any level? BaseX is possibly fetching every element node in
> the database to see if it's an element of type book or not.
> 
> or can you look at the generated query plan in more detail?
> 
> Liam
> 
> -- 
> Liam R. E. Quin 
> The World Wide Web Consortium (W3C)
  

Re: [basex-talk] Improving performance in a 40GB database

2016-07-06 Thread James Sears
Hi Christian - I've built a new database, using the same data except that this 
time I stripped out the OCR'd word elements (called ).

My estimate of the  elements representing 85% of the data was wrong. they 
represent 96.5% of the data. This means the database files have shrunk from 
40GB to 1.5GB. 

Instead of the database having ~1.5 billion nodes it now has ~78 million.

Reducing the problem space means the following xquery - run in basexgui 8.5 - 
has gone from an average of 148000ms to 3900ms:

let $start := prof:current-ns()
let $void := prof:void(for $book in //book
return
  

{
for $page in $book/page
return
  
  {
  for $article in $page/article
  return
  
  }
  
}
  )
let $end := prof:current-ns()
let $ms := ($end - $start) div 100
return $ms || ' ms' 

This is good news. However, doesn't this show an issue in how BaseX maintains 
it's indexes? What I mean is that the  elements are two children off each 
 - i.e. . If my xquery doesn't care about the 
 and the  elements - why is it still affected by them?

Thanks.

> From: christian.gr...@gmail.com
> Date: Tue, 5 Jul 2016 17:52:40 +0200
> Subject: Re: [basex-talk] Improving performance in a 40GB database
> To: james.hn.se...@outlook.com
> CC: basex-talk@mailman.uni-konstanz.de
> 
> Hi James,
> 
> > Individual OCR'd words on pages maybe comprise around 85% of the data - and 
> > I don't actually care about this data. So maybe if I just don't load these 
> > OCR'd words it will help? I haven't tried that yet, but ideally I'd like 
> > not to have to do it.
> 
> Some (more or less obvious) questions back:
> 
> * How large is the resulting XML document (around 15% of the original 
> document)?
> * How do you measure the time?
> * Do you store the result on disk?
> * How long does your query take if you wrap it into a count(...) or
> prof:void(...) function?
> 
> Thanks in advance,
> Christian
  

[basex-talk] Improving performance in a 40GB database

2016-07-05 Thread James Sears
Hi - I'm wondering what sort of recommendations anyone has for making my 
queries "go faster".

I created a database by loading 21GB of xml data - that represents the OCR'd 
contents of 5000 books - into a new basex 8.4.4 instance. 

I run optimize after the import.

basexgui> Properties> Information says it's sized at 40GB and contains around 
1.5 billion nodes; there are no binaries.

The sorts of query I'm interested in running are those that search each article 
on each page of each book - for example:

for $book in //book
return
  
    
    {
    for $page in $book/page
    return
  
  {
  for $article in $page/article
  return
  
  }
  
    }
  

On a reasonably powerful i5 laptop with SSD + plenty of RAM, this query takes 
around 148550.00 ms - I'd like to significantly reduce this.

Individual OCR'd words on pages maybe comprise around 85% of the data - and I 
don't actually care about this data. So maybe if I just don't load these OCR'd 
words it will help? I haven't tried that yet, but ideally I'd like not to have 
to do it.

So, if anyone has any tips or ideas for reducing the query time then I'd be 
very interested in hearing what you have to say.