Hi Deepak,
My two cents advice :

I found efficient to :

  *   Use a append only approach : instead of replacing a document, just add it 
to your collections. You will have to find how to get the latest version of a 
given logical document, based on a business key to be identified. For that 
purpose, you can create a kind of buffer (set of) collection(s) containing all 
the incoming documents.
  *   On schedule, build new collections containing only the latest versions 
taken from the additive collections (a new backlog), empty your buffers, and 
restart.

Do you think this could be applied to your use case ?

Best regards from French west coast,
Fabrice.


De : BaseX-Talk <basex-talk-boun...@mailman.uni-konstanz.de> De la part de 
Deepak Dinakara
Envoyé : jeudi 28 décembre 2023 09:39
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] Help - Regarding Performance Improvement


Ce mail provient d’un expéditeur extérieur à la MAIF. En cas de doute : ne 
répondez pas, ne cliquez pas sur les liens ou pièces jointes et signalez le 
message via les boutons « signaler » et/ou « signaler un hameçonnage ».
Hi,

Reaching out to get suggestions on improving performance.
Using basex to store and analyze around 350,000 to 500,000 XMLs.
Size of each XML varies between a few KBs to 5MB. Each day around 10k XMLs get 
added/patched.
I have the following queries
1) What is the optimal size or number of documents in a DB? Earlier I had 1 DB 
with different collections but inserts were too slow, took more than 30s just 
to replace a document. So split it up by some category to have around 30 DBs. 
Inserts are fine but again if there are too many documents in a category, 
patching that DB slows and querying across all DBs also gets slowed down. Any 
optimal number for DBs? Can I create many DBs like 1 for every 10K XMLs? I read 
through 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06310.html, 
having 100s of DBs cause query performance degradation? Is there any better 
solution?
2) Query performance has degraded with more documents in a DB. I also noticed 
that with/without token/attribute index, there is not much difference to query 
performance (they are just XML attribute queries). "Optimize" flag after 
inserts to recreate the index takes too much time and memory. I am not running 
it now since I didn't find significant improvement with/without index with my 
tests. Any suggestions for improving this?
3) Is it possible to just run queries against specific XMLs? I will have a 
pre-filter based on user selection and queries need to be run against only 
those XMLs. There are a number of filters users can apply and every time it can 
result in a different set of XMLs against which analysis has to be performed 
(Hence not feasible to create so many collections). Right now, I am querying 
against all XMLs even though I am interested only in a subset of XMLs and doing 
post filtering. I did go through 
https://mailman.uni-konstanz.de/pipermail/basex-talk/2010-July/000495.html, but 
again having a regex to include all the interested file paths(sometimes entire 
set of documents) will slow it down.

Thank you,
Deepak

Reply via email to