Re: [basex-talk] Text Index just over some elements

Fabrice Etanchaud Thu, 25 Sep 2014 00:41:49 -0700

Dear Oscar,

From what I read, I’m not sure you had a look at  the underlying BaseX data 
structure yet.


Xml files  in BaseX are digested in a binary format

http://docs.basex.org/wiki/Node_Storage

but ‘stored’ raw files are simply copied on the filesystem.

You can only index digested data.
Best regards,
Fabrice
Questel/Orbit


De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Oscar Herrera
Envoyé : mercredi 24 septembre 2014 19:41
À : basex-talk@mailman.uni-konstanz.de
Objet : Re: [basex-talk] Text Index just over some elements

== The Scenario ==
What we have is a dynamic collection with information from people who registers 
on the site. Basically the information is retrieved from third party companies 
that provide us the information on XML via WebService calls, so we do request 
the person information to these third parties on the moment people gets 
registered. So that's how we got into BaseX since we consider is inconvenient 
to store large XML files on a RDBMS and I don't see the point on having to 
parse all the information when we receive it to re-organize it mostly because 
from my point of view the information is already well structured via these 
large XML files.

This XML are on average 2mb each. Of course, there are some that are very small 
(80kb) as there are some that we have been advised might get up to 500mb.

So, from all the information we receive, at this moment I estimate we only need 
around 25%, I though about having different databases with full and partial 
information but the thing is that somehow the requirements are not entirely 
defined on one hand, and on the other, there's information that we use on the 
queries and some other that we still need to display to its owner and that 
we're displaying using XSLT.
== Question 1: Indexes are only required for some fields ==
We usually need to locate the records by some id, or query over some of the 
elements available on the XML files, but those are pretty much always the same, 
so those are the elements that I'd like to have indexed. That's why I don't see 
a reason for having indexes over the contents of all the elements since is 
unlikely (at least right now) we'll make use of those and instead they consume 
a lot of hard drive.
== Question 2: to store files on the filesystem or as raw on BaseX? ==
Right now, we're storing the information we receive as XML files on the file 
system on a RAID 10, anyway what's your advice?, to keep the files stored on 
the filesystem directly or to let BaseX handle those (I think this is the 
difference between add/replace and store commands right?), is there any article 
you could point me I could use for reference?, as I see BaseX right now it is 
handling the queries and the index information right now but depends on the 
filesystem to retrieve the entire document, am I right?
== Question 3: dynamic optimize and index updates? ==
As you can imagine, I'll need to have the indexes updated since"data-mining" 
will be done with the information from the people registered on it. I've seen 
is not possible to run the "optimize" command while the app is up, I'm not sure 
about the indexes getting updated on real time either, but this somehow is 
troubling me since the idea is to have the app running 24x7, and if we get to 
have a lot of registered users, to update the indexes or to optimize the db 
will take a long time, isn't it?. So any strategies on this?
== Question 4: connection pooling ==
I have only found XQJ-Pool to be used with BaseX, does anybody know about any 
other pooling mechanism available for BaseX?
Thank you so much for your help with this subject, and sorry for the long long 
email ;)
Oscar H




2014-09-24 3:21 GMT-05:00 Fabrice Etanchaud 
<fetanch...@questel.com<mailto:fetanch...@questel.com>>:
Hi Oscar,

You will have to maintain a separate collection in order to do that.

That separate collection will contain the node-pre or node-id of each value to 
be indexed.
Storing the node-pre is the faster way but require a append-only main 
collection if you do not want to have to recreate the entire separate 
collection after each main collection update.


1.       Add the new map entries (value-to-be-indexed,node-pre or node-id) in 
the separate collection

2.       Reindex the separate collection

An even faster solution is to store the values in text nodes and node-pre or 
node-id in attributes in order to create only a text index (or vice/versa). 
That will speed up the reindexation.

To use this custom index :

1.        Use the db:attribute or db:text function on the separate collection 
to obtain the list of node-id or node-pre associated with a given value,

2.       For each node-xx, use the db:open-xx function on the main collection 
to obtain the real node.

If you are familiar with CouchBase/CouchDB, it’s a little like creating a view 
;-)

But such a built-in feature would be great !

Best regards,
Fabrice Etanchaud
Questel/Orbit


De : 
basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>
 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>]
 De la part de Oscar Herrera
Envoyé : mercredi 24 septembre 2014 03:30
À : 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Objet : [basex-talk] Text Index just over some elements

Hi,
I'm trying to tune my BaseX with a text index only over certain elements, is 
that possible?, what I have found so far is to create a text index, but I have 
plenty of nodes on my documents that don't need to be indexed since is very 
unlikely that a search over that value will occur.
So, is there any way in which I can create a text index only over certain 
elements and not all of them?, if not, is this planned on a near future?

Thank you,

Oscar H

Re: [basex-talk] Text Index just over some elements

Reply via email to