Hello Fabrice

That’s brilliant, thank you very much, I will keep it in mind for future 
reference.

No, I did not set the DEBUG and yes it was directory content.

Once I find some time, I am going to run the “offending” import again with DEBUG
and send some more information in case this is indeed a bug. But, I have to say,
it may be that the DB was hitting one of its natural limits, which is fine.

All the best



From: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice 
ETANCHAUD
Sent: 14 September 2017 09:26
To: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Possible Bug in BaseX 8.2.3 when importing XML (Was 
RE: A few general questions about BaseX)

Hi Athanasios,

Did you set the DEBUG option to get detailed information ?

Could you confirm you are creating a db from a directory content ?
If this is the case, as suggested, you should generate a command script to 
force the loading order, and use this script to load the data in forced order 
to detect where it fails.
You can easily create such a bxs file in xquery with a for file:list() loop.

This should look like :

<command>
                <set option=”ADDCACHE”/>
                <set option=”DEBUG”/>
                <create-db name=”mydb”/>
<add>myphysicalpath</add>
<add>myphysicalpath</add>

..
<close/>
</command>

Best regards,
Fabrice Etanchaud

De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk]
Envoyé : mercredi 13 septembre 2017 11:23
À : 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Cc : 'Alexander Holupirek'; 'Michael Seiferle'; Fabrice ETANCHAUD; 'Bridger 
Dyson-Smith'
Objet : Possible Bug in BaseX 8.2.3 when importing XML (Was RE: [basex-talk] A 
few general questions about BaseX)

Hello everyone

Many thanks to Alexander, Bridger, Fabrice, Michael for getting back to me with 
very detailed responses, these have been really helpful.

A few notes:


1)      The name is Athanasios :D. Sorry, just couldn’t help it, it seemed 
incredibly formal to be addressed via the surname in our communications.
Our mail server advertises the “Surname. Initial” pattern, so I can see where 
the confusion came from.

2)      I think that there is scope for adding some sort of “logging” to all 
actions of the server in general because I think I may have hit a bug but I 
cannot
provide any more illuminating comments. Here is what is happening:

a.      During import, I get an error that file somethingsomething140.xml has 
an incredibly long element that cannot be imported at line (blahblah). The 
whole process just dies there.

b.      This is a bug, because if I simply imported JUST the offending file 
itself, a single file database is created without any problems and I can query 
it and all. So, maybe, the error is caused because of the previous file OR 
because of the way the files are loaded. But I have absolutely no way of 
knowing the “load history” of the files or the exception that was caught or 
anything else. In fact, once you press “OK” in the error dialog box, any 
database files that have been created are lost. In addition to this, the XML 
files to import are enumerated in a random order. So, I had to run the import 
again and stay there looking at each one of the files loading, to witness that 
the system “breaks” after 254 files (which is suspiciously close to 256). None 
of the files around the vicinity of the offending file caused any problems, so 
this may be a more difficult to catch bug (but it is thrown with both the 
internal and external parsers). Following this, I created smaller databases 
with 250 XML files and then got “predictable” errors on running out of memory 
and not creating indexes which I can solve more easily.

3)      It’s good to know that I don’t need the original files because that’s a 
lot of space I can get rid of. Thank you.

4)      Seems like the ADDCACHE would have saved me some trouble here, many 
thanks for that, but of course, if you don’t know the file enumeration order, 
you are still stuck in not knowing which files have already been imported.

5)      Michael, logging won’t help with the internal import procedure, except 
of course if you were implying writing a quick script to do the import 
“manually”?

6)      Michael, the fork-join and “client connect” are really interesting and 
worth a try before I start connecting things together via Hadoop. Are these 
modules already available to BaseX? Do I simply import their namespace or is it 
not even needed?

Many thanks again.

All the best






From: Bridger Dyson-Smith [mailto:bdysonsm...@gmail.com]
Sent: 12 September 2017 16:53
To: Anastasiou A.
Cc: 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] A few general questions about BaseX

Hi Anastasiou,
Hopefully some of these answers are somewhat helpful.

On Tue, Sep 12, 2017 at 4:54 AM, Anastasiou A. 
<a.anastas...@swansea.ac.uk<mailto:a.anastas...@swansea.ac.uk>> wrote:
Hello everyone

I am trying to load BaseX with a large number of XML files (~500), each one a 
few hundreds of MBs big.
BaseX fails with a message along the lines “This is too big for one database”.

Can I please ask:


1)      Are there any logs, beyond the DB logs? If yes, where can I find them?

a.      The reason I am asking is because once basexgui gives the message, 
there is no indication about the error.
Ideally, I would like to know if this is a limitation on memory amount or 
number of items (?).
I'm not sure how to enable more verbose logging with the GUI -- hopefully one 
of the devs or power users can weigh in on this.

2)      The parser options include reading XML files from archives, which is 
very convenient, but once the file has been
parsed, does BaseX require the “originals” for queries / returning results?
AFAIK, no it does not. BaseX will query and return results from the internal 
database(s).

3)      Is it possible to do federation with BaseX? In other words, let’s say I 
split a database in two large parts (as per #1),
is it possible to launch two baseX servers and then have them talk to each 
other so that ultimately I just query one of
them and get back unified results?
AFAIK, the preferred method is to split your files across many databases, then 
query multiple databases from a single expression[1]. Others will be able to 
speak to this better, but I don't think there's a straightforward way to run 
multiple BaseX servers in a single JVM.


All the best

Best,
Bridger

[1] http://docs.basex.org/wiki/Databases

Reply via email to