[basex-talk] // versus /*/

2019-10-06 Thread thufir
Do these have the same meaning?  Might there be a subtle distinction, or 
might they be read differently but functionally identical?



>
> xquery /*/book[@id="bk112"]

  Galos, Mike
  Visual Studio 7: A Comprehensive Guide
  Computer
  49.95
  2001-04-16
  Microsoft Visual Studio 7 is explored in depth,
  looking at how Visual Basic, Visual C++, C#, and ASP+ are
  integrated into a comprehensive development
  environment.

Query executed in 1.22 ms.
>
> xquery //book[@id="bk112"]/title
Visual Studio 7: A Comprehensive Guide
Query executed in 1.52 ms.
>
> xquery /*/book[@id="bk112"]/title
Visual Studio 7: A Comprehensive Guide
Query executed in 1.67 ms.
>


They're equally efficient, at least as used above?





thanks,

Thufir


Re: [basex-talk] get list of databases

2019-10-06 Thread thufir



thanks, worked perfectly.

On 2019-10-06 3:42 p.m., Christian Grün wrote:

Try

new List().execute(ctx);

See [1] for details.

Cheers,
Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/core/cmd/List.java#L28





Re: [basex-talk] get list of databases

2019-10-06 Thread Christian Grün
Try

new List().execute(ctx);

See [1] for details.

Cheers,
Christian

[1]
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/core/cmd/List.java#L28



thufir  schrieb am Mo., 7. Okt. 2019, 00:35:

> Hi,
>
> pardon the silly question, but how do I get a list of databases exactly?
>   I'm looking at:
>
>
> https://github.com/BaseXdb/basex/blob/master/basex-examples/src/main/java/org/basex/examples/local/CreateCollection.java
>
> and want output like:
>
>  > list
> Name   Resources  Size  Input Path
> --
> books  0  4570
>
> 1 database(s).
>  >
>
>
> but using:
>
> http://docs.basex.org/javadoc/org/basex/core/cmd/List.html
>
>
>
>
> thanks,
>
> Thufir
>


[basex-talk] get list of databases

2019-10-06 Thread thufir

Hi,

pardon the silly question, but how do I get a list of databases exactly? 
 I'm looking at:


https://github.com/BaseXdb/basex/blob/master/basex-examples/src/main/java/org/basex/examples/local/CreateCollection.java

and want output like:

> list
Name   Resources  Size  Input Path
--
books  0  4570

1 database(s).
>


but using:

http://docs.basex.org/javadoc/org/basex/core/cmd/List.html




thanks,

Thufir


Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-06 Thread Christian Grün
Some complementary notes (others may be able to tell you more about their
experiences with large data sets):

a GiST index would have to be built there, to allow full-text searches;
> PostgreSQL is picked
>

You could as well have a look at Elasticsearch or its predecessors.

there might be a leak in the BaseX implementation of XQuery.
>

I assume you are referring to the SQL Module? Feel free to attach the OOM
stack trace, it might give us more insight.

I would recommend you to write SQL commands or an SQL dump to disk (see the
BaseX File Module for now information) and run/import this file in a second
step; this is probably faster than sending hundreds of thousands of single
SQL commands via JDBC, no matter if you are using XQuery or Java.


[basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-06 Thread first name last name
Hello,

This is essentially part2 of trying to index large amounts of web data.
To summarize what happened before: The initial discussion started here [1],
Christian suggested some options, I dove into each of them, I realized that
doing this on a low-memory system is harder than I initially thought.
At Christian's suggestion, I tried to split the big db into smaller dbs and
came up with a rudimentary sharding mechanism [3].
All attempts to full-text 30GB of data in BaseX, for me, resulted in OOM
(do take into consideration that I only have 3.1GB of memory to allocate
for BaseX).

Where to?
I decided to look more into what Christian said in [2] about option 2, and
to pick the exact values that I want, and to transfer them to PostgreSQL
(after transferring, a GiST index would have to be built there, to allow
full-text searches; PostgreSQL is picked because it uses an in-memory
buffer for all large operations, and several files on disk, and if it needs
to combine results that exceed the available memory, it goes to disk, but
at all times it never exceeds the given amount of memory).

Variant 1 (see attached script pg-import.sh)
All good. So, I basically started writing XQuery that would do the
following:
- Open up a JDBC connection to PostgreSQL
- Get me all text content from each thread page of the forum, and the db it
belonged to
- Create a prepared statement for one such thread page, populate the
prepared statement, and execute it
This ended up in OOM after around 250k records. So just to be clear, 250k
lines were rows in PostgreSQL, which is nice but eventually it ended up in
OOM. (Perhaps it has to do with how the GC works in Java .. I don't know)

Variant 2 (see attached script pg-import2.sh)
I did something similar to the above:
- Open up a JDBC connection to PostgreSQL
- Get all posts and for each post get the author, the date, the message
content, the post id, the BaseX db name (cause we're going over all shards,
and each shard is a BaseX db)
- Create a prepared statement for each post with the data mentioned above,
and execute it
This also ended up in OOM after around 340k records (my approximation would
be that there were around 3M posts in the data).

To summarize, I'm tempted to believe that there might be a leak in the
BaseX implementation of XQuery.
I will provide in the following, the relevant versions of the software used:
- BaseX 9.2.4
- java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
- the JVM memory param value was  -Xmx3100m

I would be interested to know your thoughts

Thanks,
Stefan

[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-September/014715.html
[2]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014727.html
[3]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014729.html
#!/bin/bash

mkdir tmp

# TODO: work on the full-text part in PostgreSQL
# a trigger will be required to make it work.

cat << EOF > tmp/archive-schema.sql
CREATE DATABASE "archive";

\c "archive"
CREATE EXTENSION IF NOT EXISTS btree_gist;

DROP TABLE IF EXISTS thread;
CREATE TABLE IF NOT EXISTS thread (
id SERIAL PRIMARY KEY,
content TEXT,
label VARCHAR(300),
forum VARCHAR(300)
);

-- CREATE INDEX idx_thread_content ON thread USING gist(content);
CREATE INDEX idx_thread_label ON thread(label);
CREATE UNIQUE INDEX idx_thread_uniq_label_forum ON thread(label,forum);

EOF

LD_LIBRARY_PATH="" /share/Public/builds/prefix/bin/psql -U postgres -d postgres 
< tmp/archive-schema.sql



cat << 'EOF' > tmp/import.xq

let 
$conn-string:="jdbc:postgresql://localhost:5432/archive?user=postgrespassword=postgres"
let $pgconn := sql:connect($conn-string)
let $dbs:=fn:filter(db:list(), function($x){ 
matches($x,"linuxquestions-shard-") })
for $db in fn:reverse(fn:sort($dbs))
for $x in db:open($db)
let $label:=$x/fn:base-uri()
let $content:=$x//*[matches(@id,"post_message_")]/text()
let $params := 
 { $label   }
 { $content }
 { $db }
   
let $prep:=sql:prepare($pgconn, "INSERT INTO thread(label,content,forum) 
VALUES(?,?,?)")
return 
try {
sql:execute-prepared($prep,$params)
} catch * {
'Error [' || $err:code || ']: ' || $err:description || '--' || $params
}

EOF

/share/Public/builds/basex/bin/basexclient -U admin -P admin tmp/import.xq


#!/bin/bash

mkdir tmp

# TODO: work on the full-text part in PostgreSQL
# a trigger will be required to make it work.
#
# DONE: more detailed content extraction.

cat << EOF > tmp/archive-schema.sql
CREATE DATABASE "archive";

\c "archive"
CREATE EXTENSION IF NOT EXISTS btree_gist;

DROP TABLE IF EXISTS thread2;
CREATE TABLE IF NOT EXISTS thread2 (
id SERIAL PRIMARY KEY,
date VARCHAR(300),
author VARCHAR(300),
post_id VARCHAR(300),
doc_uri VARCHAR(300),
content TEXT,
basex_db VARCHAR(300)
);

CREATE INDEX idx_thread2_date ON thread2(date);
CREATE 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-06 Thread first name last name
Regarding selective full-text indexing, I just tried
XQUERY db:optimize("linuxquestions.org-selective", true(), map { 'ftindex':
true(), 'ftinclude': 'div table td a' })
And I got OOM on that, the exact stacktrace attached in this message.

I will open a separate thread regarding migrating the data from BaseX
shards to PostgreSQL (for the purpose of full-text indexing).

On Sun, Oct 6, 2019 at 10:19 AM Christian Grün 
wrote:

> The current full text index builder provides a similar outsourcing
> mechanism to that of the index builder for the default index structures;
> but the meta data structures are kept in main-memory, and they are more
> bulky. There are definitely ways to tackle this technically; it hasn't been
> of high priority so far, but this may change.
>
> Please note that you won't create an index over your whole data set in
> RDBMS. Instead, you'll usually create it for specific fields that you will
> query later on. It's a convenience feature in BaseX that you can build an
> index for all of your data. For large full-text corpora, however, it's
> recommendable in most cases to restrict indexing to the relevant XML
> elements.
>
>
>
>
> first name last name  schrieb am Sa., 5. Okt.
> 2019, 23:28:
>
>> Attached a more complete output of ./bin/basexhttp . Judging from this
>> output, it would seem that everything was ok, except for the full-text
>> index.
>> I now realize that I have another question about full-text indexes. It
>> seems like the full-text index here is dependent on the amount of memory
>> available (in other words, the more data to be indexed, the more RAM memory
>> required).
>>
>> I was using a certain popular RDBMS, for full-text indexing, and I never
>> bumped into problems like it running out of memory when building such
>> indexes.
>> I think their model uses a certain buffer in memory, and it keeps
>> multiple files on disk where it store data, and then it assembles together
>> the results in-memory
>> but always keeping the constraint of using only as much memory as was
>> declared to be allowed for it to use.
>> Perhaps the topic would be "external memory algorithms" or "full-text
>> search using secondary storage".
>> I'm not an expert in this field, but.. my question here would be if this
>> kind of thing is something that BaseX is looking to handle in the future?
>>
>> Thanks,
>> Stefan
>>
>>
>> On Sat, Oct 5, 2019 at 11:08 PM Christian Grün 
>> wrote:
>>
>>> The stack Trace indicates that you enabled the fulltext index as well.
>>> For this index, you definitely need more memory than available on your
>>> system.
>>>
>>> So I assume you didn't encounter trouble with the default index
>>> structures?
>>>
>>>
>>>
>>>
>>> first name last name  schrieb am Sa., 5. Okt.
>>> 2019, 20:52:
>>>
 Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
 allocate on that system for BaseX) and I got OOM.

 On Sat, Oct 5, 2019 at 2:19 AM Christian Grün <
 christian.gr...@gmail.com> wrote:

> About option 1: How much memory have you been able to assign to the
> Java VM?
>
>
>
>
>
> first name last name  schrieb am Sa., 5. Okt.
> 2019, 01:11:
>
>> I had another look at the script I wrote and realized that it's not
>> working as it's supposed to.
>> Apparently the order of operations should be this:
>> - turn on all the types of indexes required
>> - create the db
>> - the parser settings and the filter settings
>> - add all the files to the db
>> - run "OPTIMIZE"
>>
>> If I'm not doing them in this order (specifically with "OPTIMIZE" at
>> the end) the resulting db lacks all indexes.
>>
>>
>>
>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>> randomcod...@gmail.com> wrote:
>>
>>> Hi Christian,
>>>
>>> About option 4:
>>> I agree with the options you laid out. I am currently diving deeper
>>> into option 4 in the list you wrote.
>>> Regarding the partitioning strategy, I agree. I did manage however
>>> to partition the files to be imported, into separate sets, with a
>>> constraint on max partition size (on disk) and max partition file count
>>> (the number of XML documents in each partition).
>>> The tool called fpart [5] made this possible (I can imagine more
>>> sophisticated bin-packing methods, involving pre-computed node count
>>> values, and other variables, can be achieved via glpk [6] but that 
>>> might be
>>> too much work).
>>> So, currently I am experimenting with a max partition size of 2.4GB
>>> and a max file count of 85k files, and fpart seems to have split the 
>>> file
>>> list into 11 partitions of 33k files each and the size of a partition 
>>> being
>>> ~ 2.4GB.
>>> So, I wrote a script for this, it's called sharded-import.sh and
>>> attached here. I'm also noticing that the /dba/ BaseX web interface is 
>>> not

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-06 Thread Christian Grün
The current full text index builder provides a similar outsourcing
mechanism to that of the index builder for the default index structures;
but the meta data structures are kept in main-memory, and they are more
bulky. There are definitely ways to tackle this technically; it hasn't been
of high priority so far, but this may change.

Please note that you won't create an index over your whole data set in
RDBMS. Instead, you'll usually create it for specific fields that you will
query later on. It's a convenience feature in BaseX that you can build an
index for all of your data. For large full-text corpora, however, it's
recommendable in most cases to restrict indexing to the relevant XML
elements.




first name last name  schrieb am Sa., 5. Okt. 2019,
23:28:

> Attached a more complete output of ./bin/basexhttp . Judging from this
> output, it would seem that everything was ok, except for the full-text
> index.
> I now realize that I have another question about full-text indexes. It
> seems like the full-text index here is dependent on the amount of memory
> available (in other words, the more data to be indexed, the more RAM memory
> required).
>
> I was using a certain popular RDBMS, for full-text indexing, and I never
> bumped into problems like it running out of memory when building such
> indexes.
> I think their model uses a certain buffer in memory, and it keeps multiple
> files on disk where it store data, and then it assembles together the
> results in-memory
> but always keeping the constraint of using only as much memory as was
> declared to be allowed for it to use.
> Perhaps the topic would be "external memory algorithms" or "full-text
> search using secondary storage".
> I'm not an expert in this field, but.. my question here would be if this
> kind of thing is something that BaseX is looking to handle in the future?
>
> Thanks,
> Stefan
>
>
> On Sat, Oct 5, 2019 at 11:08 PM Christian Grün 
> wrote:
>
>> The stack Trace indicates that you enabled the fulltext index as well.
>> For this index, you definitely need more memory than available on your
>> system.
>>
>> So I assume you didn't encounter trouble with the default index
>> structures?
>>
>>
>>
>>
>> first name last name  schrieb am Sa., 5. Okt.
>> 2019, 20:52:
>>
>>> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
>>> allocate on that system for BaseX) and I got OOM.
>>>
>>> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün 
>>> wrote:
>>>
 About option 1: How much memory have you been able to assign to the
 Java VM?





 first name last name  schrieb am Sa., 5. Okt.
 2019, 01:11:

> I had another look at the script I wrote and realized that it's not
> working as it's supposed to.
> Apparently the order of operations should be this:
> - turn on all the types of indexes required
> - create the db
> - the parser settings and the filter settings
> - add all the files to the db
> - run "OPTIMIZE"
>
> If I'm not doing them in this order (specifically with "OPTIMIZE" at
> the end) the resulting db lacks all indexes.
>
>
>
> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
> randomcod...@gmail.com> wrote:
>
>> Hi Christian,
>>
>> About option 4:
>> I agree with the options you laid out. I am currently diving deeper
>> into option 4 in the list you wrote.
>> Regarding the partitioning strategy, I agree. I did manage however to
>> partition the files to be imported, into separate sets, with a constraint
>> on max partition size (on disk) and max partition file count (the number 
>> of
>> XML documents in each partition).
>> The tool called fpart [5] made this possible (I can imagine more
>> sophisticated bin-packing methods, involving pre-computed node count
>> values, and other variables, can be achieved via glpk [6] but that might 
>> be
>> too much work).
>> So, currently I am experimenting with a max partition size of 2.4GB
>> and a max file count of 85k files, and fpart seems to have split the file
>> list into 11 partitions of 33k files each and the size of a partition 
>> being
>> ~ 2.4GB.
>> So, I wrote a script for this, it's called sharded-import.sh and
>> attached here. I'm also noticing that the /dba/ BaseX web interface is 
>> not
>> blocked anymore if I run this script, as opposed to running the previous
>> import where I run
>>   CREATE DB db_name /directory/
>> which allows me to see the progress or allows me to run queries
>> before the big import finishes.
>> Maybe the downside is that it's more verbose, and prints out a ton of
>> lines like
>>   > ADD /share/Public/archive/tech-sites/
>> linuxquestions.org/threads/viewtopic_9_356613.html
>>   Resource(s) added in 47.76 ms.
>> along the way, and maybe that's slower than before.
>>
>> About option 1:
>>