Re: [basex-talk] "Out of Memory" when inserting data from one DB to another

2019-10-29 Thread BIRKNER Michael
Hi Christian,


it's been a while now and I was testing quite some things. Your solution of 
writing an intermediate XML to disk and import it was the fastest and easiest 
solution. Thank you for that.


Best regards,

Michael


Mag. Michael Birkner
AK Wien - Bibliothek
1040, Prinz Eugen Straße 20-22
T: +43 1 501 65 12455
F: +43 1 501 65 142455
M: +43 664 88957669

michael.birk...@akwien.at<mailto:michael.birk...@akwien.at>
wien.arbeiterkammer.at<http://wien.arbeiterkammer.at/>

Besuchen Sie uns auch auf:
facebook<http://www.facebook.com/arbeiterkammer/> | 
twitter<https://twitter.com/Arbeiterkammer> | 
youtube<https://www.youtube.com/user/AKoesterreich>
--
AK Extra: Wohnen, Bildung, Pflege, Digitalisierung.
Das Zukunftsprogramm der AK. Mehr für die Mitglieder.
w.ak.at/zukunftsprogramm<https://w.ak.at/zukunftsprogramm>



Von: Christian Grün 
Gesendet: Dienstag, 1. Oktober 2019 15:01
An: BIRKNER Michael
Cc: BaseX
Betreff: Re: [basex-talk] "Out of Memory" when inserting data from one DB to 
another

Hi Michael,

Your query looks pretty straightforward. As you have already guessed, it’s 
simply the big number of inserted nodes that causes the memory error.

Is there any chance to assign more memory to your BaseX Java process? If not, 
you may need to write an intermediate XML document with the desired structure 
to disk and reimport this file in a second step. You could also call your 
function multiple times, and insert only parts of you source data in a single 
run.

Hope this helps,
Christian


On Fri, Sep 27, 2019 at 12:05 PM BIRKNER Michael 
mailto:michael.birk...@akwien.at>> wrote:

Hi to all,

I get an "Out of Memory" error (using the BaseX GUI on Ubuntu Linux) when I try 
to insert quite a lot of data into a BaseX database. The use case: I have a 
database (size is about 2600 MB, 13718400 nodes) with information in  
elements that should be added to  elements in another database. The 
s have a 1 to 1 connection identified by an ID that is available in 
both databases.

An example (simplified) of the DB with the information I want to add to the 
other DB:


  
1
Some data
More data
More data
...
  
  
2
Some data
More data
More data
...
  
  
3
Some data
More data
More data
...
  
  ... many many more s


Here an example (simplified) of the DB to which the above  elements 
should be added:


  
1
Main data
More main data
More main data
...

  
  
2
Main data
More main data
More main data
...

  
  
3
Main data
More main data
More main data
...

  
  ... many many more s


This is the XQuery I use to insert the given  elements from the  
in one database to the corresponding  in the other database. It results 
in an "Out of Memory" error:

let $infoRecs := db:open('db-with-data')/collection/record
let $mainRecs := db:open('db-to-insert-data')/collection/record
for $infoRec in $infoRecs
  let $id := data($infoRec/id)
  let $mainRec := $mainRecs[id=$id]
  let $dataToInsert := $infoRec/*[not(name()='id')]
  return insert node ($dataToInsert) into $mainRec

I assume that the error is a result of the large amount of data that is 
processed. My question is if a strategy exists to work with such an amount of 
data without getting an "Out of Memory" error?

Thanks very much to everyone in advance for any hint and advice. If you need 
more information about DB setup or options just let me know.

Best regrads,
Michael


Dieses Mail ist ausschließlich für die Verwendung durch die/den darin genannten 
AdressatInnen bestimmt und kann vertrauliche bzw rechtlich geschützte 
Informationen enthalten, deren Verwendung ohne Genehmigung durch den/ die 
AbsenderIn rechtswidrig sein kann.
Falls Sie dieses Mail irrtümlich erhalten haben, informieren Sie uns bitte und 
löschen Sie die Nachricht.
UID: ATU 16209706 I https://wien.arbeiterkammer.at/datenschutz


Re: [basex-talk] "Out of Memory" when inserting data from one DB to another

2019-10-01 Thread Christian Grün
Hi Michael,

Your query looks pretty straightforward. As you have already guessed, it’s
simply the big number of inserted nodes that causes the memory error.

Is there any chance to assign more memory to your BaseX Java process? If
not, you may need to write an intermediate XML document with the desired
structure to disk and reimport this file in a second step. You could also
call your function multiple times, and insert only parts of you source data
in a single run.

Hope this helps,
Christian


On Fri, Sep 27, 2019 at 12:05 PM BIRKNER Michael 
wrote:

> Hi to all,
>
> I get an "Out of Memory" error (using the BaseX GUI on Ubuntu Linux) when
> I try to insert quite a lot of data into a BaseX database. The use case: I
> have a database (size is about 2600 MB, 13718400 nodes) with information in
>  elements that should be added to  elements in another
> database. The s have a 1 to 1 connection identified by an ID that
> is available in both databases.
>
> An example (simplified) of the DB with the information I want to add to
> the other DB:
>
> 
>   
> 1
> Some data
> More data
> More data
> ...
>   
>   
> 2
> Some data
> More data
> More data
> ...
>   
>   
> 3
> Some data
> More data
> More data
> ...
>   
>   ... many many more s
> 
>
> Here an example (simplified) of the DB to which the above  elements
> should be added:
>
> 
>   
> 1
> Main data
> More main data
> More main data
> ...
> 
>   
>   
> 2
> Main data
> More main data
> More main data
> ...
> 
>   
>   
> 3
> Main data
> More main data
> More main data
> ...
> 
>   
>   ... many many more s
> 
>
> This is the XQuery I use to insert the given  elements from the
>  in one database to the corresponding  in the other
> database. It results in an "Out of Memory" error:
>
> let $infoRecs := db:open('db-with-data')/collection/record
> let $mainRecs := db:open('db-to-insert-data')/collection/record
> for $infoRec in $infoRecs
>   let $id := data($infoRec/id)
>   let $mainRec := $mainRecs[id=$id]
>   let $dataToInsert := $infoRec/*[not(name()='id')]
>   return insert node ($dataToInsert) into $mainRec
>
> I assume that the error is a result of the large amount of data that is
> processed. My question is if a strategy exists to work with such an amount
> of data without getting an "Out of Memory" error?
>
> Thanks very much to everyone in advance for any hint and advice. If you
> need more information about DB setup or options just let me know.
>
> Best regrads,
> Michael
>
>
>


[basex-talk] "Out of Memory" when inserting data from one DB to another

2019-09-27 Thread BIRKNER Michael
Hi to all,

I get an "Out of Memory" error (using the BaseX GUI on Ubuntu Linux) when I try 
to insert quite a lot of data into a BaseX database. The use case: I have a 
database (size is about 2600 MB, 13718400 nodes) with information in  
elements that should be added to  elements in another database. The 
s have a 1 to 1 connection identified by an ID that is available in 
both databases.

An example (simplified) of the DB with the information I want to add to the 
other DB:


  
1
Some data
More data
More data
...
  
  
2
Some data
More data
More data
...
  
  
3
Some data
More data
More data
...
  
  ... many many more s


Here an example (simplified) of the DB to which the above  elements 
should be added:


  
1
Main data
More main data
More main data
...

  
  
2
Main data
More main data
More main data
...

  
  
3
Main data
More main data
More main data
...

  
  ... many many more s


This is the XQuery I use to insert the given  elements from the  
in one database to the corresponding  in the other database. It results 
in an "Out of Memory" error:

let $infoRecs := db:open('db-with-data')/collection/record
let $mainRecs := db:open('db-to-insert-data')/collection/record
for $infoRec in $infoRecs
  let $id := data($infoRec/id)
  let $mainRec := $mainRecs[id=$id]
  let $dataToInsert := $infoRec/*[not(name()='id')]
  return insert node ($dataToInsert) into $mainRec

I assume that the error is a result of the large amount of data that is 
processed. My question is if a strategy exists to work with such an amount of 
data without getting an "Out of Memory" error?

Thanks very much to everyone in advance for any hint and advice. If you need 
more information about DB setup or options just let me know.

Best regrads,
Michael


Dieses Mail ist ausschließlich für die Verwendung durch die/den darin genannten 
AdressatInnen bestimmt und kann vertrauliche bzw rechtlich geschützte 
Informationen enthalten, deren Verwendung ohne Genehmigung durch den/ die 
AbsenderIn rechtswidrig sein kann.
Falls Sie dieses Mail irrtümlich erhalten haben, informieren Sie uns bitte und 
löschen Sie die Nachricht.
UID: ATU 16209706 I https://wien.arbeiterkammer.at/datenschutz


[basex-talk] Out of memory getting document from Client API

2016-03-29 Thread buddyonweb-software
Using the client API I am trying to retrieve a 600MB file from BaseX.  Version 
I am using is BaseX 8.3.
I have successfully stored this document using the client API  without issue.
As an aside, I have update the basexserver script to include 4GB of memory as 
Java command line option.  Here is the snippet from that script:# Options for 
virtual machine (can be extended by global options)
BASEX_JVM="-Xmx4g $BASEX_JVM"

In my Java client side application I have memory set to 3 GB of memory (script 
snippet below):exec "$JAVACMD" -Xms1g -Xmx3g .rest of args.

Here is a snippet of my Java code.  Note that this works just fine when 
operation on smaller documents, but I wanted to stress test the code.
String query = "doc('theCollectionName/extract.xml')";ByteArrayOutputStream 
baos = new ByteArrayOutputStream();
ClientSession session =  new ClientSession(host, port, user, 
pwd);session.setOutputStream(baos);ClientQuery cq = session.query(query);
cq.execute();

However, I keep getting out of memory with the below stack trace.  I would 
think that the amount of space I am providing in both thebasexserver script 
(4GB) and my client side app (3GB) would be plenty to read in a 600MB file from 
BaseX.  Is there something I am overlookingor something I am not doing 
correctly?  Any insights/feedback is appreciated.

Exception in thread "queue" java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
    at java.lang.StringCoding.decode(StringCoding.java:193)
    at java.lang.StringCoding.decode(StringCoding.java:254)
    at java.lang.String.(String.java:534)
    at java.io.ByteArrayOutputStream.toString(ByteArrayOutputStream.java:221)
    at org.basex.api.client.ClientSession.exec(ClientSession.java:261)
    at org.basex.api.client.ClientQuery.execute(ClientQuery.java:102)



Re: [basex-talk] Out Of Memory

2015-01-06 Thread Christian Grün
Hi Mansi,

 curl -ig
 'http://localhost:8984/rest?run=get_query.xqn=/Archives/*/descendant::D/@name/string()'
 | cut -d: -f1 | cut -d. -f1-3 | sort | uniq -c | sort -n -r

I guess you will get your result much faster by avoiding the post
processing steps and doing everything with XQuery instead:

 (for $n in distinct-values(/Archives/descendant::D/@name)
  order by . descending
  group by ...
  return ...)[position() = 1 to ..]

Hope this helps,
Christian



 I am using Basex 8.0 beta 763cc93 build. Running this on i7  2.7GHZ MBP,
 giving 8GB to basexhttp process. it took around 34 min on a 41 GB data. I
 think, lot of time went in post processing (sorting) the result set, rather
 than actually extracting the results from BaseX DB.

 When tried a similar query on a much smaller database(3GB) on a much
 powerful amazon instance, giving 20GB RAM to basex http process, got me
 results with post processing within 4 mins.

 Thanks for all your inputs guys,

 Keep BaseXing... !!!
 - Mansi

 On Fri, Nov 7, 2014 at 12:25 PM, Mansi Sheth mansi.sh...@gmail.com wrote:

 This email chain, is extremely helpful. Thanks a ton guys. Certainly one
 of the most helpful folks here :)

 I have to try a lot of these suggestions but currently I am being pulled
 into something else, so I have to pause for the time being.

 Will get back to this email thread, after trying a few things and my
 relevant observations.

 - Mansi

 On Fri, Nov 7, 2014 at 3:48 AM, Fabrice Etanchaud fetanch...@questel.com
 wrote:

 Hi Mansi,



 From what I can see,

 for each pqr value, you could use db:attribute-range to retrieve all the
 file names, group by/count to obtain statistics.

 You could also create a new collection from an extraction of only the
 data you need, changing @name into element and use full text fuzzy match.



 Hoping it helps



 Cordialement

 Fabrice



 De : basex-talk-boun...@mailman.uni-konstanz.de
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Mansi
 Sheth
 Envoyé : jeudi 6 novembre 2014 20:55
 À : Christian Grün
 Cc : BaseX
 Objet : Re: [basex-talk] Out Of Memory



 I would be doing tons of post processing. I never use UI. I either use
 REST thru cURL or command line.



 I would basically need data in below format:



 XML File Name, @name



 I am trying to whitelist picking up values for only
 starts-with(@name,pqr). where pqr is a list of 150 odd values.



 My file names, are essentially some ID/keys, which I would need to map it
 further using sqlite to some values and may be group by it.. etc.



 So, basically I am trying to visualize some data, based on its existence
 in which xml files. So, yes count(query) would be fine, but won't solve
 much purpose, since I still need value pqr.



 - Mansi





 On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün
 christian.gr...@gmail.com wrote:

  Query: /A/*//E/@name/string()

 In the GUI, all results will be cached, so you could think about
 switching to command line.

 Do you really need to output all results, or do you do some further
 processing with the intermediate results?

 For example, the query count(/A/*//E/@name/string()) will probably
 run without getting stuck.



 
  This query, was going OOM, within few mins.
 
  I tried a few ways, of whitelisting, with contain clause, to truncate
  the
  result set. That didn't help too. So, now I am out of ideas. This is
  giving
  JVM 10GB of dedicated memory.
 
  Once, above query works and doesn't go Out Of Memory, I also need
  corresponding file names too:
 
  XYZ.xml //E/@name
  PQR.xml //E/@name
 
  Let me know if you would need more details, to appreciate the issue ?
  - Mansi
 
  On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün
  christian.gr...@gmail.com
  wrote:
 
  Hi Mansi,
 
  I think we need more information on the queries that are causing the
  problems.
 
  Best,
  Christian
 
 
 
  On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com
  wrote:
   Hello,
  
   I have a use case, where I have to extract lots in information from
   each
   XML
   in each DB. Something like, attribute values of most of the nodes in
   an
   XML.
   For such, queries based goes Out Of Memory with below exception. I
   am
   giving
   it ~12GB of RAM on i7 processor. Well I can't complain here since I
   am
   most
   definitely asking for loads of data, but is there any way I can get
   these
   kinds of data successfully ?
  
   mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
   BaseX 8.0 beta b45c1e2 [Server]
   Server was started (port: 1984)
   HTTP Server was started (port: 8984)
   Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError:
   Java
   heap
   space
   at
  
  
   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
   at
  
  
   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073

Re: [basex-talk] Out Of Memory

2014-12-30 Thread Mansi Sheth
Hello,

Wanted to get back to this email chain and share my experience.

I got this running beautifully (including all post processing of results),
using the below command:

curl -ig '
http://localhost:8984/rest?run=get_query.xqn=/Archives/*/descendant::D/@name/string()'
| cut -d: -f1 | cut -d. -f1-3 | sort | uniq -c | sort -n -r

I am using Basex 8.0 beta 763cc93 build. Running this on i7  2.7GHZ MBP,
giving 8GB to basexhttp process. it took around 34 min on a 41 GB data. I
think, lot of time went in post processing (sorting) the result set, rather
than actually extracting the results from BaseX DB.

When tried a similar query on a much smaller database(3GB) on a much
powerful amazon instance, giving 20GB RAM to basex http process, got me
results with post processing within 4 mins.

Thanks for all your inputs guys,

Keep BaseXing... !!!
- Mansi

On Fri, Nov 7, 2014 at 12:25 PM, Mansi Sheth mansi.sh...@gmail.com wrote:

 This email chain, is extremely helpful. Thanks a ton guys. Certainly one
 of the most helpful folks here :)

 I have to try a lot of these suggestions but currently I am being pulled
 into something else, so I have to pause for the time being.

 Will get back to this email thread, after trying a few things and my
 relevant observations.

 - Mansi

 On Fri, Nov 7, 2014 at 3:48 AM, Fabrice Etanchaud fetanch...@questel.com
 wrote:

  Hi Mansi,



 From what I can see,

 for each pqr value, you could use db:attribute-range to retrieve all the
 file names, group by/count to obtain statistics.

 You could also create a new collection from an extraction of only the
 data you need, changing @name into element and use full text fuzzy match.



 Hoping it helps



 Cordialement

 Fabrice



 *De :* basex-talk-boun...@mailman.uni-konstanz.de [mailto:
 basex-talk-boun...@mailman.uni-konstanz.de] *De la part de* Mansi Sheth
 *Envoyé :* jeudi 6 novembre 2014 20:55
 *À :* Christian Grün
 *Cc :* BaseX
 *Objet :* Re: [basex-talk] Out Of Memory



 I would be doing tons of post processing. I never use UI. I either use
 REST thru cURL or command line.



 I would basically need data in below format:



 XML File Name, @name



 I am trying to whitelist picking up values for only
 starts-with(@name,pqr). where pqr is a list of 150 odd values.



 My file names, are essentially some ID/keys, which I would need to map it
 further using sqlite to some values and may be group by it.. etc.



 So, basically I am trying to visualize some data, based on its existence
 in which xml files. So, yes count(query) would be fine, but won't solve
 much purpose, since I still need value pqr.



 - Mansi





 On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün 
 christian.gr...@gmail.com wrote:

  Query: /A/*//E/@name/string()

 In the GUI, all results will be cached, so you could think about
 switching to command line.

 Do you really need to output all results, or do you do some further
 processing with the intermediate results?

 For example, the query count(/A/*//E/@name/string()) will probably
 run without getting stuck.



 
  This query, was going OOM, within few mins.
 
  I tried a few ways, of whitelisting, with contain clause, to truncate
 the
  result set. That didn't help too. So, now I am out of ideas. This is
 giving
  JVM 10GB of dedicated memory.
 
  Once, above query works and doesn't go Out Of Memory, I also need
  corresponding file names too:
 
  XYZ.xml //E/@name
  PQR.xml //E/@name
 
  Let me know if you would need more details, to appreciate the issue ?
  - Mansi
 
  On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün 
 christian.gr...@gmail.com
  wrote:
 
  Hi Mansi,
 
  I think we need more information on the queries that are causing the
  problems.
 
  Best,
  Christian
 
 
 
  On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com
 wrote:
   Hello,
  
   I have a use case, where I have to extract lots in information from
 each
   XML
   in each DB. Something like, attribute values of most of the nodes in
 an
   XML.
   For such, queries based goes Out Of Memory with below exception. I am
   giving
   it ~12GB of RAM on i7 processor. Well I can't complain here since I
 am
   most
   definitely asking for loads of data, but is there any way I can get
   these
   kinds of data successfully ?
  
   mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
   BaseX 8.0 beta b45c1e2 [Server]
   Server was started (port: 1984)
   HTTP Server was started (port: 8984)
   Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError:
 Java
   heap
   space
   at
  
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
   at
  
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
   at
  
  
 org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
   at
  
  
 org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll

Re: [basex-talk] Out Of Memory

2014-12-30 Thread Florent Gallaire
For my uses, string() seems to be extremely extremely slow at processing
big data, you should try without it.

Best regards

Florent

On Tue, Dec 30, 2014 at 2:38 PM, Mansi Sheth mansi.sh...@gmail.com wrote:

 Hello,

 Wanted to get back to this email chain and share my experience.

 I got this running beautifully (including all post processing of results),
 using the below command:

 curl -ig '
 http://localhost:8984/rest?run=get_query.xqn=/Archives/*/descendant::D/@name/string()'
 | cut -d: -f1 | cut -d. -f1-3 | sort | uniq -c | sort -n -r

 I am using Basex 8.0 beta 763cc93 build. Running this on i7  2.7GHZ MBP,
 giving 8GB to basexhttp process. it took around 34 min on a 41 GB data. I
 think, lot of time went in post processing (sorting) the result set, rather
 than actually extracting the results from BaseX DB.

 When tried a similar query on a much smaller database(3GB) on a much
 powerful amazon instance, giving 20GB RAM to basex http process, got me
 results with post processing within 4 mins.

 Thanks for all your inputs guys,

 Keep BaseXing... !!!
 - Mansi

 On Fri, Nov 7, 2014 at 12:25 PM, Mansi Sheth mansi.sh...@gmail.com
 wrote:

 This email chain, is extremely helpful. Thanks a ton guys. Certainly one
 of the most helpful folks here :)

 I have to try a lot of these suggestions but currently I am being pulled
 into something else, so I have to pause for the time being.

 Will get back to this email thread, after trying a few things and my
 relevant observations.

 - Mansi

 On Fri, Nov 7, 2014 at 3:48 AM, Fabrice Etanchaud fetanch...@questel.com
  wrote:

  Hi Mansi,



 From what I can see,

 for each pqr value, you could use db:attribute-range to retrieve all the
 file names, group by/count to obtain statistics.

 You could also create a new collection from an extraction of only the
 data you need, changing @name into element and use full text fuzzy match.



 Hoping it helps



 Cordialement

 Fabrice



 *De :* basex-talk-boun...@mailman.uni-konstanz.de [mailto:
 basex-talk-boun...@mailman.uni-konstanz.de] *De la part de* Mansi Sheth
 *Envoyé :* jeudi 6 novembre 2014 20:55
 *À :* Christian Grün
 *Cc :* BaseX
 *Objet :* Re: [basex-talk] Out Of Memory



 I would be doing tons of post processing. I never use UI. I either use
 REST thru cURL or command line.



 I would basically need data in below format:



 XML File Name, @name



 I am trying to whitelist picking up values for only
 starts-with(@name,pqr). where pqr is a list of 150 odd values.



 My file names, are essentially some ID/keys, which I would need to map
 it further using sqlite to some values and may be group by it.. etc.



 So, basically I am trying to visualize some data, based on its existence
 in which xml files. So, yes count(query) would be fine, but won't solve
 much purpose, since I still need value pqr.



 - Mansi





 On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün 
 christian.gr...@gmail.com wrote:

  Query: /A/*//E/@name/string()

 In the GUI, all results will be cached, so you could think about
 switching to command line.

 Do you really need to output all results, or do you do some further
 processing with the intermediate results?

 For example, the query count(/A/*//E/@name/string()) will probably
 run without getting stuck.



 
  This query, was going OOM, within few mins.
 
  I tried a few ways, of whitelisting, with contain clause, to truncate
 the
  result set. That didn't help too. So, now I am out of ideas. This is
 giving
  JVM 10GB of dedicated memory.
 
  Once, above query works and doesn't go Out Of Memory, I also need
  corresponding file names too:
 
  XYZ.xml //E/@name
  PQR.xml //E/@name
 
  Let me know if you would need more details, to appreciate the issue ?
  - Mansi
 
  On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün 
 christian.gr...@gmail.com
  wrote:
 
  Hi Mansi,
 
  I think we need more information on the queries that are causing the
  problems.
 
  Best,
  Christian
 
 
 
  On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com
 wrote:
   Hello,
  
   I have a use case, where I have to extract lots in information from
 each
   XML
   in each DB. Something like, attribute values of most of the nodes
 in an
   XML.
   For such, queries based goes Out Of Memory with below exception. I
 am
   giving
   it ~12GB of RAM on i7 processor. Well I can't complain here since I
 am
   most
   definitely asking for loads of data, but is there any way I can get
   these
   kinds of data successfully ?
  
   mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
   BaseX 8.0 beta b45c1e2 [Server]
   Server was started (port: 1984)
   HTTP Server was started (port: 8984)
   Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError:
 Java
   heap
   space
   at
  
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
   at
  
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer

Re: [basex-talk] Out Of Memory

2014-11-07 Thread Fabrice Etanchaud
Hi Mansi,

From what I can see,
for each pqr value, you could use db:attribute-range to retrieve all the file 
names, group by/count to obtain statistics.
You could also create a new collection from an extraction of only the data you 
need, changing @name into element and use full text fuzzy match.

Hoping it helps

Cordialement
Fabrice

De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Mansi Sheth
Envoyé : jeudi 6 novembre 2014 20:55
À : Christian Grün
Cc : BaseX
Objet : Re: [basex-talk] Out Of Memory

I would be doing tons of post processing. I never use UI. I either use REST 
thru cURL or command line.

I would basically need data in below format:

XML File Name, @name

I am trying to whitelist picking up values for only starts-with(@name,pqr). 
where pqr is a list of 150 odd values.

My file names, are essentially some ID/keys, which I would need to map it 
further using sqlite to some values and may be group by it.. etc.

So, basically I am trying to visualize some data, based on its existence in 
which xml files. So, yes count(query) would be fine, but won't solve much 
purpose, since I still need value pqr.

- Mansi


On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün 
christian.gr...@gmail.commailto:christian.gr...@gmail.com wrote:
 Query: /A/*//E/@name/string()

In the GUI, all results will be cached, so you could think about
switching to command line.

Do you really need to output all results, or do you do some further
processing with the intermediate results?

For example, the query count(/A/*//E/@name/string()) will probably
run without getting stuck.



 This query, was going OOM, within few mins.

 I tried a few ways, of whitelisting, with contain clause, to truncate the
 result set. That didn't help too. So, now I am out of ideas. This is giving
 JVM 10GB of dedicated memory.

 Once, above query works and doesn't go Out Of Memory, I also need
 corresponding file names too:

 XYZ.xml //E/@name
 PQR.xml //E/@name

 Let me know if you would need more details, to appreciate the issue ?
 - Mansi

 On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün 
 christian.gr...@gmail.commailto:christian.gr...@gmail.com
 wrote:

 Hi Mansi,

 I think we need more information on the queries that are causing the
 problems.

 Best,
 Christian



 On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth 
 mansi.sh...@gmail.commailto:mansi.sh...@gmail.com wrote:
  Hello,
 
  I have a use case, where I have to extract lots in information from each
  XML
  in each DB. Something like, attribute values of most of the nodes in an
  XML.
  For such, queries based goes Out Of Memory with below exception. I am
  giving
  it ~12GB of RAM on i7 processor. Well I can't complain here since I am
  most
  definitely asking for loads of data, but is there any way I can get
  these
  kinds of data successfully ?
 
  mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
  BaseX 8.0 beta b45c1e2 [Server]
  Server was started (port: 1984)
  HTTP Server was started (port: 8984)
  Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError: Java
  heap
  space
  at
 
  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
  at
 
  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
  at
 
  org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
  at
 
  org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
  at
 
  org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
  at
 
  org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
  at java.lang.Thread.run(Thread.java:744)
 
 
  --
  - Mansi




 --
 - Mansi



--
- Mansi


Re: [basex-talk] Out Of Memory

2014-11-07 Thread Christian Grün
Hi Mansi,

 Once, above query works and doesn't go Out Of Memory, I also need 
 corresponding file names too:

Sorry, I skipped this one. Here is one way to do it:

declare option output:item-separator #xa;;
for $db in db:open('')
let $path := db:path($db)
for $name in $db//E/@name
return $path || out:tab() || $name

I was surprised to hear that you are getting OOM errors on
command-line, because the query you mentioned should then be evaluated
in a streaming fashion (i. e., it should require very low and constant
memory).

Could you try the above query? If it fails, could you possibly send me
the query plan? On command line, it can be retrieved via the -x flag.

I just remember that you have been using xquery:eval, right? My guess
it that it occurs in combination with this function, because it may
require all results to be cached before they are being sent back to
the client. Do you think you can alternatively put your queries into
files, or do you need more flexibility?

Christian



On Thu, Nov 6, 2014 at 8:58 PM, Mansi Sheth mansi.sh...@gmail.com wrote:
 Briefly explaining, trying to extract these values/per xml file (where .xml
 files are ID), to map it to its corresponding values.

 Imagine, you have 100s of customers, and each customer uses/needs 1000s of
 different @name. These @name would be similar across customer, but few
 would be using some values, few customer some other. Trying to collect all
 this information and find, which @name is used by most customer and so on
 and so forth. There are few such use cases, this one being most generic.


 On Thu, Nov 6, 2014 at 11:23 AM, Fabrice Etanchaud fetanch...@questel.com
 wrote:

 The solution depends on the usage you will have of your extraction.

 May I ask you what is your extraction for ?



 Best regards,

 Fabrice



 De : Mansi Sheth [mailto:mansi.sh...@gmail.com]
 Envoyé : jeudi 6 novembre 2014 17:11
 À : Fabrice Etanchaud
 Cc : Christian Grün; BaseX


 Objet : Re: [basex-talk] Out Of Memory



 Interesting idea, I thought of using db partition, but didn't pursue it
 further, mainly due to below thought process.



 Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db,
 which would be growing quickly. So, below approach would lead to ~3000 more
 files (which would be increasing), increasing I/O operations considerably
 for further pre-processing.



 However, I don't really care if process takes few minutes to few hours (as
 long as its not day(s) ;)). Given the situation and my options, I would
 surely try this.



 Database, is currently indexed at attribute level, as thats what I would
 be querying the most. Do you think, I should do anything differently ?



 Thanks,

 - Mansi



 On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud
 fetanch...@questel.com wrote:

 Hi Mansi,



 Here you have a natural partition of your data : the files you ingested.

 So my first suggestion would be to query your data on a file basis:



 for $doc in db:open(‘your_collection_name’)

 let $file-name := db:path($doc)

 return

 file:write(

 $file-name,

 names

{

for $name in
 $doc//E/@name/data()

return


 name{$name}/name

 }

 /names

 )



 Is it for indexing ?



 Hope it helps,



 Best regards,



 Fabrice Etanchaud

 Questel/Orbit



 De : basex-talk-boun...@mailman.uni-konstanz.de
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Mansi
 Sheth
 Envoyé : jeudi 6 novembre 2014 16:33
 À : Christian Grün
 Cc : BaseX
 Objet : Re: [basex-talk] Out Of Memory



 This would need a lot of details, so bear with me below:



 Briefly my XML files look like:



 A name=

 B name=

C name=

 D name=

  E name=/



 A can contain B, C or D and B, C or D can contain E. We have 1000s
 (currently 3000 in my test data set) of such xml files, of size 50MB on an
 average. Its tons of data ! Currently, my database is of ~18GB in size.



 Query: /A/*//E/@name/string()



 This query, was going OOM, within few mins.



 I tried a few ways, of whitelisting, with contain clause, to truncate the
 result set. That didn't help too. So, now I am out of ideas. This is giving
 JVM 10GB of dedicated memory.



 Once, above query works and doesn't go Out Of Memory, I also need
 corresponding file names too:



 XYZ.xml //E/@name

 PQR.xml //E/@name



 Let me know if you would need more details, to appreciate the issue ?

 - Mansi



 On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gr...@gmail.com
 wrote:

 Hi Mansi,

 I think we need more information on the queries that are causing the
 problems.

 Best,
 Christian




 On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com wrote:
  Hello,
 
  I have a use case, where I have to extract lots in information from each
  XML
  in each DB. Something like, attribute values of most

Re: [basex-talk] Out Of Memory

2014-11-07 Thread Christian Grün
 do you need more flexibility?

To partially answer my own question, it might be interesting for you
to hear that you have various ways of specifying queries via REST [1]:

* You can store your query server-side and use the ?run=... argument
to evaluate this query file
* You can send a POST request, which contains the query to be evaluated.

In both cases, intermediate results won't be cached, but directly
streamed back to the client.

Hope this helps,
Christian

[1] http://docs.basex.org/wiki/REST



On Fri, Nov 7, 2014 at 10:48 AM, Christian Grün
christian.gr...@gmail.com wrote:
 declare option output:item-separator #xa;;
 for $db in db:open('')
 let $path := db:path($db)
 for $name in $db//E/@name
 return $path || out:tab() || $name


Re: [basex-talk] Out Of Memory

2014-11-07 Thread Mansi Sheth
This email chain, is extremely helpful. Thanks a ton guys. Certainly one of
the most helpful folks here :)

I have to try a lot of these suggestions but currently I am being pulled
into something else, so I have to pause for the time being.

Will get back to this email thread, after trying a few things and my
relevant observations.

- Mansi

On Fri, Nov 7, 2014 at 3:48 AM, Fabrice Etanchaud fetanch...@questel.com
wrote:

  Hi Mansi,



 From what I can see,

 for each pqr value, you could use db:attribute-range to retrieve all the
 file names, group by/count to obtain statistics.

 You could also create a new collection from an extraction of only the data
 you need, changing @name into element and use full text fuzzy match.



 Hoping it helps



 Cordialement

 Fabrice



 *De :* basex-talk-boun...@mailman.uni-konstanz.de [mailto:
 basex-talk-boun...@mailman.uni-konstanz.de] *De la part de* Mansi Sheth
 *Envoyé :* jeudi 6 novembre 2014 20:55
 *À :* Christian Grün
 *Cc :* BaseX
 *Objet :* Re: [basex-talk] Out Of Memory



 I would be doing tons of post processing. I never use UI. I either use
 REST thru cURL or command line.



 I would basically need data in below format:



 XML File Name, @name



 I am trying to whitelist picking up values for only
 starts-with(@name,pqr). where pqr is a list of 150 odd values.



 My file names, are essentially some ID/keys, which I would need to map it
 further using sqlite to some values and may be group by it.. etc.



 So, basically I am trying to visualize some data, based on its existence
 in which xml files. So, yes count(query) would be fine, but won't solve
 much purpose, since I still need value pqr.



 - Mansi





 On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün christian.gr...@gmail.com
 wrote:

  Query: /A/*//E/@name/string()

 In the GUI, all results will be cached, so you could think about
 switching to command line.

 Do you really need to output all results, or do you do some further
 processing with the intermediate results?

 For example, the query count(/A/*//E/@name/string()) will probably
 run without getting stuck.



 
  This query, was going OOM, within few mins.
 
  I tried a few ways, of whitelisting, with contain clause, to truncate the
  result set. That didn't help too. So, now I am out of ideas. This is
 giving
  JVM 10GB of dedicated memory.
 
  Once, above query works and doesn't go Out Of Memory, I also need
  corresponding file names too:
 
  XYZ.xml //E/@name
  PQR.xml //E/@name
 
  Let me know if you would need more details, to appreciate the issue ?
  - Mansi
 
  On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün 
 christian.gr...@gmail.com
  wrote:
 
  Hi Mansi,
 
  I think we need more information on the queries that are causing the
  problems.
 
  Best,
  Christian
 
 
 
  On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com
 wrote:
   Hello,
  
   I have a use case, where I have to extract lots in information from
 each
   XML
   in each DB. Something like, attribute values of most of the nodes in
 an
   XML.
   For such, queries based goes Out Of Memory with below exception. I am
   giving
   it ~12GB of RAM on i7 processor. Well I can't complain here since I am
   most
   definitely asking for loads of data, but is there any way I can get
   these
   kinds of data successfully ?
  
   mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
   BaseX 8.0 beta b45c1e2 [Server]
   Server was started (port: 1984)
   HTTP Server was started (port: 8984)
   Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError:
 Java
   heap
   space
   at
  
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
   at
  
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
   at
  
  
 org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
   at
  
  
 org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
   at
  
  
 org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
   at
  
  
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
   at java.lang.Thread.run(Thread.java:744)
  
  
   --
   - Mansi
 
 
 
 
  --
  - Mansi





 --

 - Mansi




-- 
- Mansi


Re: [basex-talk] Out Of Memory

2014-11-06 Thread Mansi Sheth
This would need a lot of details, so bear with me below:

Briefly my XML files look like:

A name=
B name=
   C name=
D name=
 E name=/

A can contain B, C or D and B, C or D can contain E. We have 1000s
(currently 3000 in my test data set) of such xml files, of size 50MB on an
average. Its tons of data ! Currently, my database is of ~18GB in size.

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the
result set. That didn't help too. So, now I am out of ideas. This is giving
JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need
corresponding file names too:

XYZ.xml //E/@name
PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?
- Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gr...@gmail.com
wrote:

 Hi Mansi,

 I think we need more information on the queries that are causing the
 problems.

 Best,
 Christian



 On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com wrote:
  Hello,
 
  I have a use case, where I have to extract lots in information from each
 XML
  in each DB. Something like, attribute values of most of the nodes in an
 XML.
  For such, queries based goes Out Of Memory with below exception. I am
 giving
  it ~12GB of RAM on i7 processor. Well I can't complain here since I am
 most
  definitely asking for loads of data, but is there any way I can get these
  kinds of data successfully ?
 
  mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
  BaseX 8.0 beta b45c1e2 [Server]
  Server was started (port: 1984)
  HTTP Server was started (port: 8984)
  Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError: Java
 heap
  space
  at
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
  at
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
  at
 
 org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
  at java.lang.Thread.run(Thread.java:744)
 
 
  --
  - Mansi




-- 
- Mansi


Re: [basex-talk] Out Of Memory

2014-11-06 Thread Fabrice Etanchaud
Hi Mansi,

Here you have a natural partition of your data : the files you ingested.
So my first suggestion would be to query your data on a file basis:

for $doc in db:open(‘your_collection_name’)
let $file-name := db:path($doc)
return
file:write(
$file-name,
names
   {
   for $name in $doc//E/@name/data()
   return
   
name{$name}/name
}
/names
)

Is it for indexing ?

Hope it helps,

Best regards,

Fabrice Etanchaud
Questel/Orbit

De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Mansi Sheth
Envoyé : jeudi 6 novembre 2014 16:33
À : Christian Grün
Cc : BaseX
Objet : Re: [basex-talk] Out Of Memory

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

A name=
B name=
   C name=
D name=
 E name=/

A can contain B, C or D and B, C or D can contain E. We have 1000s 
(currently 3000 in my test data set) of such xml files, of size 50MB on an 
average. Its tons of data ! Currently, my database is of ~18GB in size.

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the 
result set. That didn't help too. So, now I am out of ideas. This is giving JVM 
10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding 
file names too:

XYZ.xml //E/@name
PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?
- Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün 
christian.gr...@gmail.commailto:christian.gr...@gmail.com wrote:
Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best,
Christian



On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth 
mansi.sh...@gmail.commailto:mansi.sh...@gmail.com wrote:
 Hello,

 I have a use case, where I have to extract lots in information from each XML
 in each DB. Something like, attribute values of most of the nodes in an XML.
 For such, queries based goes Out Of Memory with below exception. I am giving
 it ~12GB of RAM on i7 processor. Well I can't complain here since I am most
 definitely asking for loads of data, but is there any way I can get these
 kinds of data successfully ?

 mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
 BaseX 8.0 beta b45c1e2 [Server]
 Server was started (port: 1984)
 HTTP Server was started (port: 8984)
 Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError: Java heap
 space
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
 at
 org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
 at java.lang.Thread.run(Thread.java:744)


 --
 - Mansi



--
- Mansi


Re: [basex-talk] Out Of Memory

2014-11-06 Thread Mansi Sheth
Interesting idea, I thought of using db partition, but didn't pursue it
further, mainly due to below thought process.

Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db,
which would be growing quickly. So, below approach would lead to ~3000 more
files (which would be increasing), increasing I/O operations considerably
for further pre-processing.

However, I don't really care if process takes few minutes to few hours (as
long as its not day(s) ;)). Given the situation and my options, I would
surely try this.

Database, is currently indexed at attribute level, as thats what I would be
querying the most. Do you think, I should do anything differently ?

Thanks,
- Mansi

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud fetanch...@questel.com
wrote:

  Hi Mansi,



 Here you have a natural partition of your data : the files you ingested.

 So my first suggestion would be to query your data on a file basis:



 for $doc in db:open(‘your_collection_name’)

 let $file-name := db:path($doc)

 return

 file:write(

 $file-name,

 names

{

for $name in
 $doc//E/@name/data()

return


 name{$name}/name

 }

 /names

 )



 Is it for indexing ?



 Hope it helps,



 Best regards,



 Fabrice Etanchaud

 Questel/Orbit



 *De :* basex-talk-boun...@mailman.uni-konstanz.de [mailto:
 basex-talk-boun...@mailman.uni-konstanz.de] *De la part de* Mansi Sheth
 *Envoyé :* jeudi 6 novembre 2014 16:33
 *À :* Christian Grün
 *Cc :* BaseX
 *Objet :* Re: [basex-talk] Out Of Memory



 This would need a lot of details, so bear with me below:



 Briefly my XML files look like:



 A name=

 B name=

C name=

 D name=

  E name=/



 A can contain B, C or D and B, C or D can contain E. We have 1000s
 (currently 3000 in my test data set) of such xml files, of size 50MB on an
 average. Its tons of data ! Currently, my database is of ~18GB in size.



 Query: /A/*//E/@name/string()



 This query, was going OOM, within few mins.



 I tried a few ways, of whitelisting, with contain clause, to truncate the
 result set. That didn't help too. So, now I am out of ideas. This is giving
 JVM 10GB of dedicated memory.



 Once, above query works and doesn't go Out Of Memory, I also need
 corresponding file names too:



 XYZ.xml //E/@name

 PQR.xml //E/@name



 Let me know if you would need more details, to appreciate the issue ?

 - Mansi



 On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gr...@gmail.com
 wrote:

 Hi Mansi,

 I think we need more information on the queries that are causing the
 problems.

 Best,
 Christian




 On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com wrote:
  Hello,
 
  I have a use case, where I have to extract lots in information from each
 XML
  in each DB. Something like, attribute values of most of the nodes in an
 XML.
  For such, queries based goes Out Of Memory with below exception. I am
 giving
  it ~12GB of RAM on i7 processor. Well I can't complain here since I am
 most
  definitely asking for loads of data, but is there any way I can get these
  kinds of data successfully ?
 
  mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
  BaseX 8.0 beta b45c1e2 [Server]
  Server was started (port: 1984)
  HTTP Server was started (port: 8984)
  Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError: Java
 heap
  space
  at
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
  at
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
  at
 
 org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
  at
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
  at java.lang.Thread.run(Thread.java:744)
 
 
  --
  - Mansi





 --

 - Mansi




-- 
- Mansi


Re: [basex-talk] Out Of Memory

2014-11-06 Thread Graydon Saunders
Hi Mansi --

Just out of habitual paranoia about the performance of *// in XPath, I
might try replacing /A/*//E/@name/string()  with
E[ancestor::A[not(parent::*)]/@name and not worry about stringifying
the resulting sequence of attribute nodes until the next step,
whatever that might be.  It might not matter to the optimizer at all,
but it might.

Also, from your description of the data, do you care where the tree is
rooted or just that you've got an E?  If it _is_ just an E, what you
want might look like

for x in E/@name return (string($x),tokenize(base-uri($x),'/')[last()])

Do you need to worry about cases where @name is empty?

-- Graydon

On Thu, Nov 6, 2014 at 11:11 AM, Mansi Sheth mansi.sh...@gmail.com wrote:
 Interesting idea, I thought of using db partition, but didn't pursue it
 further, mainly due to below thought process.

 Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db,
 which would be growing quickly. So, below approach would lead to ~3000 more
 files (which would be increasing), increasing I/O operations considerably
 for further pre-processing.

 However, I don't really care if process takes few minutes to few hours (as
 long as its not day(s) ;)). Given the situation and my options, I would
 surely try this.

 Database, is currently indexed at attribute level, as thats what I would be
 querying the most. Do you think, I should do anything differently ?

 Thanks,
 - Mansi

 On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud fetanch...@questel.com
 wrote:

 Hi Mansi,



 Here you have a natural partition of your data : the files you ingested.

 So my first suggestion would be to query your data on a file basis:



 for $doc in db:open(‘your_collection_name’)

 let $file-name := db:path($doc)

 return

 file:write(

 $file-name,

 names

{

for $name in
 $doc//E/@name/data()

return


 name{$name}/name

 }

 /names

 )



 Is it for indexing ?



 Hope it helps,



 Best regards,



 Fabrice Etanchaud

 Questel/Orbit



 De : basex-talk-boun...@mailman.uni-konstanz.de
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Mansi
 Sheth
 Envoyé : jeudi 6 novembre 2014 16:33
 À : Christian Grün
 Cc : BaseX
 Objet : Re: [basex-talk] Out Of Memory



 This would need a lot of details, so bear with me below:



 Briefly my XML files look like:



 A name=

 B name=

C name=

 D name=

  E name=/



 A can contain B, C or D and B, C or D can contain E. We have 1000s
 (currently 3000 in my test data set) of such xml files, of size 50MB on an
 average. Its tons of data ! Currently, my database is of ~18GB in size.



 Query: /A/*//E/@name/string()



 This query, was going OOM, within few mins.



 I tried a few ways, of whitelisting, with contain clause, to truncate the
 result set. That didn't help too. So, now I am out of ideas. This is giving
 JVM 10GB of dedicated memory.



 Once, above query works and doesn't go Out Of Memory, I also need
 corresponding file names too:



 XYZ.xml //E/@name

 PQR.xml //E/@name



 Let me know if you would need more details, to appreciate the issue ?

 - Mansi



 On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gr...@gmail.com
 wrote:

 Hi Mansi,

 I think we need more information on the queries that are causing the
 problems.

 Best,
 Christian




 On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com wrote:
  Hello,
 
  I have a use case, where I have to extract lots in information from each
  XML
  in each DB. Something like, attribute values of most of the nodes in an
  XML.
  For such, queries based goes Out Of Memory with below exception. I am
  giving
  it ~12GB of RAM on i7 processor. Well I can't complain here since I am
  most
  definitely asking for loads of data, but is there any way I can get
  these
  kinds of data successfully ?
 
  mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
  BaseX 8.0 beta b45c1e2 [Server]
  Server was started (port: 1984)
  HTTP Server was started (port: 8984)
  Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError: Java
  heap
  space
  at
 
  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
  at
 
  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
  at
 
  org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
  at
 
  org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
  at
 
  org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
  at
 
  org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
  at java.lang.Thread.run(Thread.java:744)
 
 
  --
  - Mansi





 --

 - Mansi




 --
 - Mansi


Re: [basex-talk] Out Of Memory

2014-11-06 Thread Fabrice Etanchaud
The solution depends on the usage you will have of your extraction.
May I ask you what is your extraction for ?

Best regards,
Fabrice

De : Mansi Sheth [mailto:mansi.sh...@gmail.com]
Envoyé : jeudi 6 novembre 2014 17:11
À : Fabrice Etanchaud
Cc : Christian Grün; BaseX
Objet : Re: [basex-talk] Out Of Memory

Interesting idea, I thought of using db partition, but didn't pursue it 
further, mainly due to below thought process.

Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db, which 
would be growing quickly. So, below approach would lead to ~3000 more files 
(which would be increasing), increasing I/O operations considerably for further 
pre-processing.

However, I don't really care if process takes few minutes to few hours (as long 
as its not day(s) ;)). Given the situation and my options, I would surely try 
this.

Database, is currently indexed at attribute level, as thats what I would be 
querying the most. Do you think, I should do anything differently ?

Thanks,
- Mansi

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud 
fetanch...@questel.commailto:fetanch...@questel.com wrote:
Hi Mansi,

Here you have a natural partition of your data : the files you ingested.
So my first suggestion would be to query your data on a file basis:

for $doc in db:open(‘your_collection_name’)
let $file-name := db:path($doc)
return
file:write(
$file-name,
names
   {
   for $name in $doc//E/@name/data()
   return
   
name{$name}/name
}
/names
)

Is it for indexing ?

Hope it helps,

Best regards,

Fabrice Etanchaud
Questel/Orbit

De : 
basex-talk-boun...@mailman.uni-konstanz.demailto:basex-talk-boun...@mailman.uni-konstanz.de
 
[mailto:basex-talk-boun...@mailman.uni-konstanz.demailto:basex-talk-boun...@mailman.uni-konstanz.de]
 De la part de Mansi Sheth
Envoyé : jeudi 6 novembre 2014 16:33
À : Christian Grün
Cc : BaseX
Objet : Re: [basex-talk] Out Of Memory

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

A name=
B name=
   C name=
D name=
 E name=/

A can contain B, C or D and B, C or D can contain E. We have 1000s 
(currently 3000 in my test data set) of such xml files, of size 50MB on an 
average. Its tons of data ! Currently, my database is of ~18GB in size.

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the 
result set. That didn't help too. So, now I am out of ideas. This is giving JVM 
10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding 
file names too:

XYZ.xml //E/@name
PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?
- Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün 
christian.gr...@gmail.commailto:christian.gr...@gmail.com wrote:
Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best,
Christian



On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth 
mansi.sh...@gmail.commailto:mansi.sh...@gmail.com wrote:
 Hello,

 I have a use case, where I have to extract lots in information from each XML
 in each DB. Something like, attribute values of most of the nodes in an XML.
 For such, queries based goes Out Of Memory with below exception. I am giving
 it ~12GB of RAM on i7 processor. Well I can't complain here since I am most
 definitely asking for loads of data, but is there any way I can get these
 kinds of data successfully ?

 mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
 BaseX 8.0 beta b45c1e2 [Server]
 Server was started (port: 1984)
 HTTP Server was started (port: 8984)
 Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError: Java heap
 space
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
 at
 org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
 at java.lang.Thread.run(Thread.java:744)


 --
 - Mansi



--
- Mansi



--
- Mansi


Re: [basex-talk] Out Of Memory

2014-11-06 Thread Mansi Sheth
I would be doing tons of post processing. I never use UI. I either use REST
thru cURL or command line.

I would basically need data in below format:

XML File Name, @name

I am trying to whitelist picking up values for only
starts-with(@name,pqr). where pqr is a list of 150 odd values.

My file names, are essentially some ID/keys, which I would need to map it
further using sqlite to some values and may be group by it.. etc.

So, basically I am trying to visualize some data, based on its existence in
which xml files. So, yes count(query) would be fine, but won't solve much
purpose, since I still need value pqr.

- Mansi


On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün christian.gr...@gmail.com
wrote:

  Query: /A/*//E/@name/string()

 In the GUI, all results will be cached, so you could think about
 switching to command line.

 Do you really need to output all results, or do you do some further
 processing with the intermediate results?

 For example, the query count(/A/*//E/@name/string()) will probably
 run without getting stuck.


 
  This query, was going OOM, within few mins.
 
  I tried a few ways, of whitelisting, with contain clause, to truncate the
  result set. That didn't help too. So, now I am out of ideas. This is
 giving
  JVM 10GB of dedicated memory.
 
  Once, above query works and doesn't go Out Of Memory, I also need
  corresponding file names too:
 
  XYZ.xml //E/@name
  PQR.xml //E/@name
 
  Let me know if you would need more details, to appreciate the issue ?
  - Mansi
 
  On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün 
 christian.gr...@gmail.com
  wrote:
 
  Hi Mansi,
 
  I think we need more information on the queries that are causing the
  problems.
 
  Best,
  Christian
 
 
 
  On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com
 wrote:
   Hello,
  
   I have a use case, where I have to extract lots in information from
 each
   XML
   in each DB. Something like, attribute values of most of the nodes in
 an
   XML.
   For such, queries based goes Out Of Memory with below exception. I am
   giving
   it ~12GB of RAM on i7 processor. Well I can't complain here since I am
   most
   definitely asking for loads of data, but is there any way I can get
   these
   kinds of data successfully ?
  
   mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp
   BaseX 8.0 beta b45c1e2 [Server]
   Server was started (port: 1984)
   HTTP Server was started (port: 8984)
   Exception in thread qtp2068921630-18 java.lang.OutOfMemoryError:
 Java
   heap
   space
   at
  
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)
   at
  
  
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)
   at
  
  
 org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
   at
  
  
 org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
   at
  
  
 org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
   at
  
  
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
   at java.lang.Thread.run(Thread.java:744)
  
  
   --
   - Mansi
 
 
 
 
  --
  - Mansi




-- 
- Mansi


Re: [basex-talk] Out Of Memory

2014-11-06 Thread Graydon Saunders
Hi Mansi --

If you use

for x in E/@name[starts-with(.,'pqr')] return
(tokenize(base-uri($x),'/')[last()],string($x))

for each of the 150-odd values (you may want to generate the query :)
it will more likely work.  It's not just the size of the database,
it's the size of the result, too; keeping the individual query results
small gives the optimizer a chance to recognize it's done with some
data and free up some memory.  I've had to work pretty hard at keeping
the intermediate stages small enough to fit in memory for queries
before where simple queries on the ~4 GB database were quite fast.  It
was large intermediate data structures that would run out the
available memory.

-- Graydon

On Thu, Nov 6, 2014 at 2:58 PM, Mansi Sheth mansi.sh...@gmail.com wrote:
 Briefly explaining, trying to extract these values/per xml file (where .xml
 files are ID), to map it to its corresponding values.

 Imagine, you have 100s of customers, and each customer uses/needs 1000s of
 different @name. These @name would be similar across customer, but few
 would be using some values, few customer some other. Trying to collect all
 this information and find, which @name is used by most customer and so on
 and so forth. There are few such use cases, this one being most generic.


 On Thu, Nov 6, 2014 at 11:23 AM, Fabrice Etanchaud fetanch...@questel.com
 wrote:

 The solution depends on the usage you will have of your extraction.

 May I ask you what is your extraction for ?



 Best regards,

 Fabrice



 De : Mansi Sheth [mailto:mansi.sh...@gmail.com]
 Envoyé : jeudi 6 novembre 2014 17:11
 À : Fabrice Etanchaud
 Cc : Christian Grün; BaseX


 Objet : Re: [basex-talk] Out Of Memory



 Interesting idea, I thought of using db partition, but didn't pursue it
 further, mainly due to below thought process.



 Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db,
 which would be growing quickly. So, below approach would lead to ~3000 more
 files (which would be increasing), increasing I/O operations considerably
 for further pre-processing.



 However, I don't really care if process takes few minutes to few hours (as
 long as its not day(s) ;)). Given the situation and my options, I would
 surely try this.



 Database, is currently indexed at attribute level, as thats what I would
 be querying the most. Do you think, I should do anything differently ?



 Thanks,

 - Mansi



 On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud
 fetanch...@questel.com wrote:

 Hi Mansi,



 Here you have a natural partition of your data : the files you ingested.

 So my first suggestion would be to query your data on a file basis:



 for $doc in db:open(‘your_collection_name’)

 let $file-name := db:path($doc)

 return

 file:write(

 $file-name,

 names

{

for $name in
 $doc//E/@name/data()

return


 name{$name}/name

 }

 /names

 )



 Is it for indexing ?



 Hope it helps,



 Best regards,



 Fabrice Etanchaud

 Questel/Orbit



 De : basex-talk-boun...@mailman.uni-konstanz.de
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Mansi
 Sheth
 Envoyé : jeudi 6 novembre 2014 16:33
 À : Christian Grün
 Cc : BaseX
 Objet : Re: [basex-talk] Out Of Memory



 This would need a lot of details, so bear with me below:



 Briefly my XML files look like:



 A name=

 B name=

C name=

 D name=

  E name=/



 A can contain B, C or D and B, C or D can contain E. We have 1000s
 (currently 3000 in my test data set) of such xml files, of size 50MB on an
 average. Its tons of data ! Currently, my database is of ~18GB in size.



 Query: /A/*//E/@name/string()



 This query, was going OOM, within few mins.



 I tried a few ways, of whitelisting, with contain clause, to truncate the
 result set. That didn't help too. So, now I am out of ideas. This is giving
 JVM 10GB of dedicated memory.



 Once, above query works and doesn't go Out Of Memory, I also need
 corresponding file names too:



 XYZ.xml //E/@name

 PQR.xml //E/@name



 Let me know if you would need more details, to appreciate the issue ?

 - Mansi



 On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gr...@gmail.com
 wrote:

 Hi Mansi,

 I think we need more information on the queries that are causing the
 problems.

 Best,
 Christian




 On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sh...@gmail.com wrote:
  Hello,
 
  I have a use case, where I have to extract lots in information from each
  XML
  in each DB. Something like, attribute values of most of the nodes in an
  XML.
  For such, queries based goes Out Of Memory with below exception. I am
  giving
  it ~12GB of RAM on i7 processor. Well I can't complain here since I am
  most
  definitely asking for loads of data, but is there any way I can get
  these
  kinds of data successfully ?
 
  mansi