Re: [basex-talk] db:optimize

2020-01-29 Thread first name last name
But if you are willing to give up that GUI, you could instead use the DBA
interface [2] in your browser, and write all your code there if you start
BaseX's http server [1].
However, the DBA interface will not give you auto-complete (and other nice
things you see in the GUI), but you can still write code, run code, and
save it to disk there.

[1] http://docs.basex.org/wiki/Command-Line_Options#HTTP_Server
[2] http://docs.basex.org/wiki/DBA


On Wed, Jan 29, 2020 at 10:57 AM first name last name <
randomcod...@gmail.com> wrote:

> "For example, if you only read data, you can easily run several clients
> (standalone, GUI, database clients) at the same time. If you update your
> data, however, you shouldn’t use the GUI or a standalone instance at the
> same time."
> [1] http://docs.basex.org/wiki/Startup#Concurrent_Operations
>
> On Wed, Jan 29, 2020 at 9:25 AM Ветошкин Владимир 
> wrote:
>
>> Hi, everybody!
>>
>> I use basex+php.
>> When I call db:optimize('A') from basex gui I can read from db 'B'.
>> But if I call db:optimize('A') from php (php cli) I can't read 'B' while
>> the first query is working.
>> Why? And how can I solve it?
>> I wracked my brain...
>>
>> --
>> С уважением,
>> Ветошкин Владимир Владимирович
>>
>>
>


Re: [basex-talk] db:optimize

2020-01-29 Thread first name last name
"For example, if you only read data, you can easily run several clients
(standalone, GUI, database clients) at the same time. If you update your
data, however, you shouldn’t use the GUI or a standalone instance at the
same time."
[1] http://docs.basex.org/wiki/Startup#Concurrent_Operations

On Wed, Jan 29, 2020 at 9:25 AM Ветошкин Владимир 
wrote:

> Hi, everybody!
>
> I use basex+php.
> When I call db:optimize('A') from basex gui I can read from db 'B'.
> But if I call db:optimize('A') from php (php cli) I can't read 'B' while
> the first query is working.
> Why? And how can I solve it?
> I wracked my brain...
>
> --
> С уважением,
> Ветошкин Владимир Владимирович
>
>


Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-08 Thread first name last name
On Mon, Oct 7, 2019 at 1:13 AM Christian Grün 
wrote:

>
> I would recommend you to write SQL commands or an SQL dump to disk (see
> the BaseX File Module for now information) and run/import this file in a
> second step; this is probably faster than sending hundreds of thousands of
> single SQL commands via JDBC, no matter if you are using XQuery or Java.
>
>
Ok, so I finally managed to reach a compromise regarding BaseX capabilities
and the hardware that I have at my disposal (for the time being).
This message will probably answer thread [1] as well (which is separate
from this but seems to ask the same question basically, which is, how to
use BaseX as an command-line XQuery processor).
The script attached will take a large collection of HTML documents, it will
pack them into small "balanced" sets, and then it will run XQuery on them
using BaseX.
The result will be a lot of SQL files ready to be imported in PostgreSQL
(with some small tweaks, the data could be adapted to be imported in
Elasticsearch).

I'm also including some benchmark data:

On system1 the following times were recorded: If run with -j4 it does 200
forum thread pages in 10 seconds.
And apparently there's about 5 posts on average per thread page. So in
85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum
thread pages) and have them prepared to be imported in PostgreSQL. With -j4
the observed peak memory usage was 500MB.

I've tested the script attached on the following two systems:
system1 config:
- BaseX 9.2.4
- script (from util-linux 2.31.1)
- GNU Parallel 20161222
- Ubuntu 18.04 LTS

system1 hardware:
- cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores)
- memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
- disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM

system2 config:
- BaseX 9.2.4
- GNU Parallel 20181222
- script (from util-linux 2.34)

system2 hardware:
- cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
- memory: 4GB RAM DDR @ 1600MHz
- disk: HDD ST3000VN007-2E4166 @ 5900 rpm

[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html
#!/bin/bash
#
# This script leverages BaseX as an XQuery command line processor
# by using multiple small disposable BaseX databases, and parallelizing the 
entire processing.
# It will essentially run XQuery in batches on large data sets, and produce
# SQL insert statements, so the data can be imported into PostgreSQL.
#
# We're packing files for processing, and we're trying to balance them out in 
sets
# such that two constraints are met:
# - no more than 100 files per set
# - no more than 100*90k bytes per set
#
# Timing:
#
# On system1 the following times were recorded:
# If run with -j4 it does 200 thread pages in 10 seconds.
# And apparently there's about 5 posts on average per thread page.
# so in 85000 seconds (which is almost a day).
# So in a day, it would process ~1.7M posts (in 340k forum thread pages)
# and have them prepared to be imported in PostgreSQL.
# Again, for -j4, the observed peak memory usage was 500MB.
#
# Notes:
#
# 1)
# The following error(found through strace) would manifest itself because
# of GNU Parallel mainly:
# --- stopped by SIGTTOU --- 
# It's also described here: 
# https://notmuchmail.org/pipermail/notmuch/2019/028015.html
# It can be circumvented throuhg the use of script
# (script - make typescript of terminal session)
#
# 2) --linebuffer is used for GNU Parallel so it can write to stdout as
# soon as possible.
#
#
# system1 config:
# - BaseX 9.2.4
# - script (from util-linux 2.31.1)
# - GNU Parallel 20161222
# - Ubuntu 18.04 LTS
# 
# system1 hardware:
# - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
# - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
# - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM 
#
# system2 config:
# - BaseX 9.2.4
# - GNU Parallel 20181222
# - script (from util-linux 2.34)
#
# system2 hardware:
# - cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
# - memory: 4GB RAM DDR @ 1600MHz
# - disk: HDD ST3000VN007-2E4166 @ 5900 rpm
#
#

BASEX="$HOME/basex-preloaded-forums/bin/basex"
mkdir meta-process
echo "Partitioning files into different sets ..."
#fpart -f 100 -s $((100 * 9)) `pwd`/threads/ -o meta-process/files-shard

proc() {
s="$1"
f="$2"
j="$3"

echo "$s -- $j -- $f"

# Build script to create temp db, and import all the html files
SHARD_IMPORT_SCRIPT=$(pwd)"/tmp/import.script.$s"
SHARD_PROCESS_SCRIPT=$(pwd)"/tmp/process.script.$s"
SHARD_SQL_OUT=$(pwd)"/tmp/partial.$j.sql"

cat << EOF > $SHARD_IMPORT_SCRIPT
DROP   DB tmp-$s
CREATE DB tmp-$s
SET PARSER html
SET HTMLPARSER 
method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
SET CREATEFILTER *.html
EOF
cat $f | perl -pne 's{^}{ADD }g;' >> $SHARD_IMPORT_SCRIPT ;
 
script --return -c "$BASEX < $SHARD_IMPORT_SCRIPT >/dev/null ; echo 'Importing 
Done'; "
 
# Build processing script, to pull values and build SQL queries
echo "for \$x in 

Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-07 Thread first name last name
I'm currently at work and my setup is at home. In about 7 hours I'll get
home and I will send the stack trace.

Meanwhile, is there any way to write a FLWOR, a loop, in a batched style?

Like for example in my case, this approach I described to migrate data from
BaseX to PostgreSQL makes use of BaseX as an XQuery processor and transfers
the full-text indexing to PostgreSQL, this is what I'm trying to do.

However, in order to avoid OOM, I am thinking of batching the transfer into
chunks, and potentially restart the BaseX server in between the migration
of each chunk.
That's why I am asking how I could do that in BaseX. My hope is that the
OOM could be avoided in this way, because not all the data would pass
through main memory and there would be less chances of the JVM GC having to
deal with this data.
Restarting the BaseX server between each chunk transfer would help making
sure that whatever memory was used is released.

So I wonder if something like
( to ]
would work here. Of course, some count would have to be done beforehand to
know how many batches there will be. Or maybe even without knowing how many
batches there will be, a while-type loop could be written in Bash with the
stop conditon being to check if the current chunk is empty.

Would an approach like this work to mitigate the OOM? Are there
alternatives or work-arounds to this kind of OOM?

Thanks












On Mon, Oct 7, 2019, 1:13 AM Christian Grün 
wrote:

> Some complementary notes (others may be able to tell you more about their
> experiences with large data sets):
>
> a GiST index would have to be built there, to allow full-text searches;
>> PostgreSQL is picked
>>
>
> You could as well have a look at Elasticsearch or its predecessors.
>
> there might be a leak in the BaseX implementation of XQuery.
>>
>
> I assume you are referring to the SQL Module? Feel free to attach the OOM
> stack trace, it might give us more insight.
>
> I would recommend you to write SQL commands or an SQL dump to disk (see
> the BaseX File Module for now information) and run/import this file in a
> second step; this is probably faster than sending hundreds of thousands of
> single SQL commands via JDBC, no matter if you are using XQuery or Java.
>
>
>
>
>


[basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-06 Thread first name last name
Hello,

This is essentially part2 of trying to index large amounts of web data.
To summarize what happened before: The initial discussion started here [1],
Christian suggested some options, I dove into each of them, I realized that
doing this on a low-memory system is harder than I initially thought.
At Christian's suggestion, I tried to split the big db into smaller dbs and
came up with a rudimentary sharding mechanism [3].
All attempts to full-text 30GB of data in BaseX, for me, resulted in OOM
(do take into consideration that I only have 3.1GB of memory to allocate
for BaseX).

Where to?
I decided to look more into what Christian said in [2] about option 2, and
to pick the exact values that I want, and to transfer them to PostgreSQL
(after transferring, a GiST index would have to be built there, to allow
full-text searches; PostgreSQL is picked because it uses an in-memory
buffer for all large operations, and several files on disk, and if it needs
to combine results that exceed the available memory, it goes to disk, but
at all times it never exceeds the given amount of memory).

Variant 1 (see attached script pg-import.sh)
All good. So, I basically started writing XQuery that would do the
following:
- Open up a JDBC connection to PostgreSQL
- Get me all text content from each thread page of the forum, and the db it
belonged to
- Create a prepared statement for one such thread page, populate the
prepared statement, and execute it
This ended up in OOM after around 250k records. So just to be clear, 250k
lines were rows in PostgreSQL, which is nice but eventually it ended up in
OOM. (Perhaps it has to do with how the GC works in Java .. I don't know)

Variant 2 (see attached script pg-import2.sh)
I did something similar to the above:
- Open up a JDBC connection to PostgreSQL
- Get all posts and for each post get the author, the date, the message
content, the post id, the BaseX db name (cause we're going over all shards,
and each shard is a BaseX db)
- Create a prepared statement for each post with the data mentioned above,
and execute it
This also ended up in OOM after around 340k records (my approximation would
be that there were around 3M posts in the data).

To summarize, I'm tempted to believe that there might be a leak in the
BaseX implementation of XQuery.
I will provide in the following, the relevant versions of the software used:
- BaseX 9.2.4
- java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
- the JVM memory param value was  -Xmx3100m

I would be interested to know your thoughts

Thanks,
Stefan

[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-September/014715.html
[2]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014727.html
[3]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014729.html
#!/bin/bash

mkdir tmp

# TODO: work on the full-text part in PostgreSQL
# a trigger will be required to make it work.

cat << EOF > tmp/archive-schema.sql
CREATE DATABASE "archive";

\c "archive"
CREATE EXTENSION IF NOT EXISTS btree_gist;

DROP TABLE IF EXISTS thread;
CREATE TABLE IF NOT EXISTS thread (
id SERIAL PRIMARY KEY,
content TEXT,
label VARCHAR(300),
forum VARCHAR(300)
);

-- CREATE INDEX idx_thread_content ON thread USING gist(content);
CREATE INDEX idx_thread_label ON thread(label);
CREATE UNIQUE INDEX idx_thread_uniq_label_forum ON thread(label,forum);

EOF

LD_LIBRARY_PATH="" /share/Public/builds/prefix/bin/psql -U postgres -d postgres 
< tmp/archive-schema.sql



cat << 'EOF' > tmp/import.xq

let 
$conn-string:="jdbc:postgresql://localhost:5432/archive?user=postgrespassword=postgres"
let $pgconn := sql:connect($conn-string)
let $dbs:=fn:filter(db:list(), function($x){ 
matches($x,"linuxquestions-shard-") })
for $db in fn:reverse(fn:sort($dbs))
for $x in db:open($db)
let $label:=$x/fn:base-uri()
let $content:=$x//*[matches(@id,"post_message_")]/text()
let $params := 
 { $label   }
 { $content }
 { $db }
   
let $prep:=sql:prepare($pgconn, "INSERT INTO thread(label,content,forum) 
VALUES(?,?,?)")
return 
try {
sql:execute-prepared($prep,$params)
} catch * {
'Error [' || $err:code || ']: ' || $err:description || '--' || $params
}

EOF

/share/Public/builds/basex/bin/basexclient -U admin -P admin tmp/import.xq


#!/bin/bash

mkdir tmp

# TODO: work on the full-text part in PostgreSQL
# a trigger will be required to make it work.
#
# DONE: more detailed content extraction.

cat << EOF > tmp/archive-schema.sql
CREATE DATABASE "archive";

\c "archive"
CREATE EXTENSION IF NOT EXISTS btree_gist;

DROP TABLE IF EXISTS thread2;
CREATE TABLE IF NOT EXISTS thread2 (
id SERIAL PRIMARY KEY,
date VARCHAR(300),
author VARCHAR(300),
post_id VARCHAR(300),
doc_uri VARCHAR(300),
content TEXT,
basex_db VARCHAR(300)
);

CREATE INDEX idx_thread2_date ON thread2(date);
CREATE 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-06 Thread first name last name
Regarding selective full-text indexing, I just tried
XQUERY db:optimize("linuxquestions.org-selective", true(), map { 'ftindex':
true(), 'ftinclude': 'div table td a' })
And I got OOM on that, the exact stacktrace attached in this message.

I will open a separate thread regarding migrating the data from BaseX
shards to PostgreSQL (for the purpose of full-text indexing).

On Sun, Oct 6, 2019 at 10:19 AM Christian Grün 
wrote:

> The current full text index builder provides a similar outsourcing
> mechanism to that of the index builder for the default index structures;
> but the meta data structures are kept in main-memory, and they are more
> bulky. There are definitely ways to tackle this technically; it hasn't been
> of high priority so far, but this may change.
>
> Please note that you won't create an index over your whole data set in
> RDBMS. Instead, you'll usually create it for specific fields that you will
> query later on. It's a convenience feature in BaseX that you can build an
> index for all of your data. For large full-text corpora, however, it's
> recommendable in most cases to restrict indexing to the relevant XML
> elements.
>
>
>
>
> first name last name  schrieb am Sa., 5. Okt.
> 2019, 23:28:
>
>> Attached a more complete output of ./bin/basexhttp . Judging from this
>> output, it would seem that everything was ok, except for the full-text
>> index.
>> I now realize that I have another question about full-text indexes. It
>> seems like the full-text index here is dependent on the amount of memory
>> available (in other words, the more data to be indexed, the more RAM memory
>> required).
>>
>> I was using a certain popular RDBMS, for full-text indexing, and I never
>> bumped into problems like it running out of memory when building such
>> indexes.
>> I think their model uses a certain buffer in memory, and it keeps
>> multiple files on disk where it store data, and then it assembles together
>> the results in-memory
>> but always keeping the constraint of using only as much memory as was
>> declared to be allowed for it to use.
>> Perhaps the topic would be "external memory algorithms" or "full-text
>> search using secondary storage".
>> I'm not an expert in this field, but.. my question here would be if this
>> kind of thing is something that BaseX is looking to handle in the future?
>>
>> Thanks,
>> Stefan
>>
>>
>> On Sat, Oct 5, 2019 at 11:08 PM Christian Grün 
>> wrote:
>>
>>> The stack Trace indicates that you enabled the fulltext index as well.
>>> For this index, you definitely need more memory than available on your
>>> system.
>>>
>>> So I assume you didn't encounter trouble with the default index
>>> structures?
>>>
>>>
>>>
>>>
>>> first name last name  schrieb am Sa., 5. Okt.
>>> 2019, 20:52:
>>>
>>>> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
>>>> allocate on that system for BaseX) and I got OOM.
>>>>
>>>> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün <
>>>> christian.gr...@gmail.com> wrote:
>>>>
>>>>> About option 1: How much memory have you been able to assign to the
>>>>> Java VM?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> first name last name  schrieb am Sa., 5. Okt.
>>>>> 2019, 01:11:
>>>>>
>>>>>> I had another look at the script I wrote and realized that it's not
>>>>>> working as it's supposed to.
>>>>>> Apparently the order of operations should be this:
>>>>>> - turn on all the types of indexes required
>>>>>> - create the db
>>>>>> - the parser settings and the filter settings
>>>>>> - add all the files to the db
>>>>>> - run "OPTIMIZE"
>>>>>>
>>>>>> If I'm not doing them in this order (specifically with "OPTIMIZE" at
>>>>>> the end) the resulting db lacks all indexes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>>>>>> randomcod...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> About option 4:
>>>>>>> I agree with the options you laid out. I am currently diving deeper
>>

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-05 Thread first name last name
Attached a more complete output of ./bin/basexhttp . Judging from this
output, it would seem that everything was ok, except for the full-text
index.
I now realize that I have another question about full-text indexes. It
seems like the full-text index here is dependent on the amount of memory
available (in other words, the more data to be indexed, the more RAM memory
required).

I was using a certain popular RDBMS, for full-text indexing, and I never
bumped into problems like it running out of memory when building such
indexes.
I think their model uses a certain buffer in memory, and it keeps multiple
files on disk where it store data, and then it assembles together the
results in-memory
but always keeping the constraint of using only as much memory as was
declared to be allowed for it to use.
Perhaps the topic would be "external memory algorithms" or "full-text
search using secondary storage".
I'm not an expert in this field, but.. my question here would be if this
kind of thing is something that BaseX is looking to handle in the future?

Thanks,
Stefan


On Sat, Oct 5, 2019 at 11:08 PM Christian Grün 
wrote:

> The stack Trace indicates that you enabled the fulltext index as well. For
> this index, you definitely need more memory than available on your system.
>
> So I assume you didn't encounter trouble with the default index structures?
>
>
>
>
> first name last name  schrieb am Sa., 5. Okt.
> 2019, 20:52:
>
>> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
>> allocate on that system for BaseX) and I got OOM.
>>
>> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün 
>> wrote:
>>
>>> About option 1: How much memory have you been able to assign to the Java
>>> VM?
>>>
>>>
>>>
>>>
>>>
>>> first name last name  schrieb am Sa., 5. Okt.
>>> 2019, 01:11:
>>>
>>>> I had another look at the script I wrote and realized that it's not
>>>> working as it's supposed to.
>>>> Apparently the order of operations should be this:
>>>> - turn on all the types of indexes required
>>>> - create the db
>>>> - the parser settings and the filter settings
>>>> - add all the files to the db
>>>> - run "OPTIMIZE"
>>>>
>>>> If I'm not doing them in this order (specifically with "OPTIMIZE" at
>>>> the end) the resulting db lacks all indexes.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>>>> randomcod...@gmail.com> wrote:
>>>>
>>>>> Hi Christian,
>>>>>
>>>>> About option 4:
>>>>> I agree with the options you laid out. I am currently diving deeper
>>>>> into option 4 in the list you wrote.
>>>>> Regarding the partitioning strategy, I agree. I did manage however to
>>>>> partition the files to be imported, into separate sets, with a constraint
>>>>> on max partition size (on disk) and max partition file count (the number 
>>>>> of
>>>>> XML documents in each partition).
>>>>> The tool called fpart [5] made this possible (I can imagine more
>>>>> sophisticated bin-packing methods, involving pre-computed node count
>>>>> values, and other variables, can be achieved via glpk [6] but that might 
>>>>> be
>>>>> too much work).
>>>>> So, currently I am experimenting with a max partition size of 2.4GB
>>>>> and a max file count of 85k files, and fpart seems to have split the file
>>>>> list into 11 partitions of 33k files each and the size of a partition 
>>>>> being
>>>>> ~ 2.4GB.
>>>>> So, I wrote a script for this, it's called sharded-import.sh and
>>>>> attached here. I'm also noticing that the /dba/ BaseX web interface is not
>>>>> blocked anymore if I run this script, as opposed to running the previous
>>>>> import where I run
>>>>>   CREATE DB db_name /directory/
>>>>> which allows me to see the progress or allows me to run queries before
>>>>> the big import finishes.
>>>>> Maybe the downside is that it's more verbose, and prints out a ton of
>>>>> lines like
>>>>>   > ADD /share/Public/archive/tech-sites/
>>>>> linuxquestions.org/threads/viewtopic_9_356613.html
>>>>>   Resource(s) added in 47.76 ms.
>>>>> along the way, and maybe that's slower than before.
>>>>>
>>>>

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-05 Thread first name last name
Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
allocate on that system for BaseX) and I got OOM.

On Sat, Oct 5, 2019 at 2:19 AM Christian Grün 
wrote:

> About option 1: How much memory have you been able to assign to the Java
> VM?
>
>
>
>
>
> first name last name  schrieb am Sa., 5. Okt.
> 2019, 01:11:
>
>> I had another look at the script I wrote and realized that it's not
>> working as it's supposed to.
>> Apparently the order of operations should be this:
>> - turn on all the types of indexes required
>> - create the db
>> - the parser settings and the filter settings
>> - add all the files to the db
>> - run "OPTIMIZE"
>>
>> If I'm not doing them in this order (specifically with "OPTIMIZE" at the
>> end) the resulting db lacks all indexes.
>>
>>
>>
>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>> randomcod...@gmail.com> wrote:
>>
>>> Hi Christian,
>>>
>>> About option 4:
>>> I agree with the options you laid out. I am currently diving deeper into
>>> option 4 in the list you wrote.
>>> Regarding the partitioning strategy, I agree. I did manage however to
>>> partition the files to be imported, into separate sets, with a constraint
>>> on max partition size (on disk) and max partition file count (the number of
>>> XML documents in each partition).
>>> The tool called fpart [5] made this possible (I can imagine more
>>> sophisticated bin-packing methods, involving pre-computed node count
>>> values, and other variables, can be achieved via glpk [6] but that might be
>>> too much work).
>>> So, currently I am experimenting with a max partition size of 2.4GB and
>>> a max file count of 85k files, and fpart seems to have split the file list
>>> into 11 partitions of 33k files each and the size of a partition being ~
>>> 2.4GB.
>>> So, I wrote a script for this, it's called sharded-import.sh and
>>> attached here. I'm also noticing that the /dba/ BaseX web interface is not
>>> blocked anymore if I run this script, as opposed to running the previous
>>> import where I run
>>>   CREATE DB db_name /directory/
>>> which allows me to see the progress or allows me to run queries before
>>> the big import finishes.
>>> Maybe the downside is that it's more verbose, and prints out a ton of
>>> lines like
>>>   > ADD /share/Public/archive/tech-sites/
>>> linuxquestions.org/threads/viewtopic_9_356613.html
>>>   Resource(s) added in 47.76 ms.
>>> along the way, and maybe that's slower than before.
>>>
>>> About option 1:
>>> Re: increase memory, I am running these experiments on a low-memory,
>>> old, network-attached storage, model QNAP TS-451+ [7] [8], which I had to
>>> take apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory),
>>> and I can't seem to find around the house any additional memory sticks to
>>> take it up to 8GB (which is also the maximum memory it supports). And if I
>>> want to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
>>> match what it supports, I'm having trouble finding the exact one, Corsair
>>> says it has memory sticks that would work, but I'd have to wait weeks for
>>> them to ship to Bucharest which is where I live.
>>> It seems like buying an Intel NUC that goes up to 64GB of memory would
>>> be a bit too expensive at $1639 [9] but .. people on reddit [10] were
>>> discussing some years back about this supermicro server [11] which is only
>>> $668 and would allow to add up to 64GB of memory.
>>> Basically I would buy something cheap that I can jampack with a lot of
>>> RAM, but a hands-off approach would be best here, so if it comes
>>> pre-equipped with all the memory and everything, would be nice (would spare
>>> the trouble of having to buy the memory separate, making sure it matches
>>> the motherboard specs etc).
>>>
>>> About option 2:
>>> In fact, that's a great idea. But it would require me to write something
>>> that would figure out the XPath patterns where the actual content sits. I
>>> actually wanted to look for some algorithm that's designed to do that, and
>>> try to implement it, but I had no time.
>>> It would either have to detect the repetitive bloated nodes, and build
>>> XPaths for the rest of the nodes, where the actual content sits. I think
>>> this would be equivalent to computing the

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-04 Thread first name last name
I had another look at the script I wrote and realized that it's not working
as it's supposed to.
Apparently the order of operations should be this:
- turn on all the types of indexes required
- create the db
- the parser settings and the filter settings
- add all the files to the db
- run "OPTIMIZE"

If I'm not doing them in this order (specifically with "OPTIMIZE" at the
end) the resulting db lacks all indexes.



On Fri, Oct 4, 2019 at 11:32 PM first name last name 
wrote:

> Hi Christian,
>
> About option 4:
> I agree with the options you laid out. I am currently diving deeper into
> option 4 in the list you wrote.
> Regarding the partitioning strategy, I agree. I did manage however to
> partition the files to be imported, into separate sets, with a constraint
> on max partition size (on disk) and max partition file count (the number of
> XML documents in each partition).
> The tool called fpart [5] made this possible (I can imagine more
> sophisticated bin-packing methods, involving pre-computed node count
> values, and other variables, can be achieved via glpk [6] but that might be
> too much work).
> So, currently I am experimenting with a max partition size of 2.4GB and a
> max file count of 85k files, and fpart seems to have split the file list
> into 11 partitions of 33k files each and the size of a partition being ~
> 2.4GB.
> So, I wrote a script for this, it's called sharded-import.sh and attached
> here. I'm also noticing that the /dba/ BaseX web interface is not blocked
> anymore if I run this script, as opposed to running the previous import
> where I run
>   CREATE DB db_name /directory/
> which allows me to see the progress or allows me to run queries before the
> big import finishes.
> Maybe the downside is that it's more verbose, and prints out a ton of
> lines like
>   > ADD /share/Public/archive/tech-sites/
> linuxquestions.org/threads/viewtopic_9_356613.html
>   Resource(s) added in 47.76 ms.
> along the way, and maybe that's slower than before.
>
> About option 1:
> Re: increase memory, I am running these experiments on a low-memory, old,
> network-attached storage, model QNAP TS-451+ [7] [8], which I had to take
> apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory), and
> I can't seem to find around the house any additional memory sticks to take
> it up to 8GB (which is also the maximum memory it supports). And if I want
> to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
> match what it supports, I'm having trouble finding the exact one, Corsair
> says it has memory sticks that would work, but I'd have to wait weeks for
> them to ship to Bucharest which is where I live.
> It seems like buying an Intel NUC that goes up to 64GB of memory would be
> a bit too expensive at $1639 [9] but .. people on reddit [10] were
> discussing some years back about this supermicro server [11] which is only
> $668 and would allow to add up to 64GB of memory.
> Basically I would buy something cheap that I can jampack with a lot of
> RAM, but a hands-off approach would be best here, so if it comes
> pre-equipped with all the memory and everything, would be nice (would spare
> the trouble of having to buy the memory separate, making sure it matches
> the motherboard specs etc).
>
> About option 2:
> In fact, that's a great idea. But it would require me to write something
> that would figure out the XPath patterns where the actual content sits. I
> actually wanted to look for some algorithm that's designed to do that, and
> try to implement it, but I had no time.
> It would either have to detect the repetitive bloated nodes, and build
> XPaths for the rest of the nodes, where the actual content sits. I think
> this would be equivalent to computing the "web template" of a website,
> given all its pages.
> It would definitely decrease the size of the content that would have to be
> indexed.
> By the way, here I'm writing about a more general procedure, because it's
> not just this dataset that I want to import.. I want to import heavy, large
> amounts of data :)
>
> These are my thoughts for now
>
> [5] https://github.com/martymac/fpart
> [6] https://www.gnu.org/software/glpk/
> [7] https://www.amazon.com/dp/B015VNLGF8
> [8] https://www.qnap.com/en/product/ts-451+
> [9] https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/
> [10]
> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/
> [11]
> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E
>
>
> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün 
> wrote:
>
>> Exactly, it seems to be the final MERGE step during index creation
>> that blows up your system

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-04 Thread first name last name
g; it cannot be automated, though, as the
> partitioning strategy depends on the characteristics of your XML input
> data (some people have huge standalone documents, others have millions
> of small documents, …).
>
> [1] http://docs.basex.org/wiki/Indexes
> [2] A single CREATE call may be sufficient: CREATE DB database
> sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz
>
>
>
>
> On Thu, Oct 3, 2019 at 8:53 AM first name last name
>  wrote:
> >
> > I tried again, using SPLITSIZE = 12 in the .basex config file
> > The batch(console) script I used is attached mass-import.xq
> > This time I didn't do the optimize or index creation post-import, but
> instead, I did it as part of the import similar to what
> > is described in [4].
> > This time I got a different error, that is,
> "org.basex.core.BaseXException: Out of Main Memory."
> > So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any
> difference here?
> >
> > Thanks
> >
> > [4] http://docs.basex.org/wiki/Indexes#Performance
> >
> >
> > On Wed, Oct 2, 2019 at 11:06 AM first name last name <
> randomcod...@gmail.com> wrote:
> >>
> >> Hey Christian,
> >>
> >> Thank you for your answer :)
> >> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the same
> OOM behavior. It looks like the memory consumption is moderate until when
> it reaches about 30GB (the size of the db before optimize) and
> >> then memory consumption spikes, and OOM occurs. Now I'm trying with
> SPLITSIZE = 1000 and will report back if I get OOM again.
> >> Regarding what you said, it might be that the merge step is where the
> OOM occurs (I wonder if there's any way to control how much memory is being
> used inside the merge step).
> >>
> >> To quote the statistics page from the wiki:
> >> Databases in BaseX are light-weight. If a database limit is
> reached, you can distribute your documents across multiple database
> instances and access all of them with a single XQuery expression.
> >> This to me sounds like sharding. I would probably be able to split the
> documents into chunks and upload them under a db with the same prefix, but
> varying suffix.. seems a lot like shards. By doing this
> >> I think I can avoid OOM, but if BaseX provides other, better, maybe
> native mechanisms of avoiding OOM, I would try them.
> >>
> >> Best regards,
> >> Stefan
> >>
> >>
> >> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün <
> christian.gr...@gmail.com> wrote:
> >>>
> >>> Hi first name,
> >>>
> >>> If you optimize your database, the indexes will be rebuilt. In this
> >>> step, the builder tries to guess how much free memory is still
> >>> available. If memory is exhausted, parts of the index will be split
> >>> (i. e., partially written to disk) and merged in a final step.
> >>> However, you can circumvent the heuristics by manually assigning a
> >>> static split value; see [1] for more information. If you use the DBA,
> >>> you’ll need to assign this value to your .basex or the web.xml file
> >>> [2]. In order to find the best value for your setup, it may be easier
> >>> to play around with the BaseX GUI.
> >>>
> >>> As you have already seen in our statistics, an XML document has
> >>> various properties that may represent a limit for a single database.
> >>> Accordingly, these properties make it difficult to decide for the
> >>> system when the memory will be exhausted during an import or index
> >>> rebuild.
> >>>
> >>> In general, you’ll get best performance (and your memory consumption
> >>> will be lower) if you create your database and specify the data to be
> >>> imported in a single run. This is currently not possible via the DBA;
> >>> use the GUI (Create Database) or console mode (CREATE DB command)
> >>> instead.
> >>>
> >>> Hope this helps,
> >>> Christian
> >>>
> >>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
> >>> [2] http://docs.basex.org/wiki/Configuration
> >>>
> >>>
> >>>
> >>> On Mon, Sep 30, 2019 at 7:09 AM first name last name
> >>>  wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > Let's say there's a 30GB dataset [3] containing most threads/posts
> from [1].
> >>> > After importing all of it, when I try to run /dba/db-optim

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-03 Thread first name last name
I tried again, using SPLITSIZE = 12 in the .basex config file
The batch(console) script I used is attached mass-import.xq
This time I didn't do the optimize or index creation post-import, but
instead, I did it as part of the import similar to what
is described in [4].
This time I got a different error, that is, "org.basex.core.BaseXException:
Out of Main Memory.*"*
So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any
difference here?

Thanks

[4] http://docs.basex.org/wiki/Indexes#Performance


On Wed, Oct 2, 2019 at 11:06 AM first name last name 
wrote:

> Hey Christian,
>
> Thank you for your answer :)
> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the same OOM
> behavior. It looks like the memory consumption is moderate until when it
> reaches about 30GB (the size of the db before optimize) and
> then memory consumption spikes, and OOM occurs. Now I'm trying with
> SPLITSIZE = 1000 and will report back if I get OOM again.
> Regarding what you said, it might be that the merge step is where the OOM
> occurs (I wonder if there's any way to control how much memory is being
> used inside the merge step).
>
> To quote the statistics page from the wiki:
> Databases <http://docs.basex.org/wiki/Databases> in BaseX are
> light-weight. If a database limit is reached, you can distribute your
> documents across multiple database instances and access all of them with a
> single XQuery expression.
> This to me sounds like sharding. I would probably be able to split the
> documents into chunks and upload them under a db with the same prefix, but
> varying suffix.. seems a lot like shards. By doing this
> I think I can avoid OOM, but if BaseX provides other, better, maybe native
> mechanisms of avoiding OOM, I would try them.
>
> Best regards,
> Stefan
>
>
> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün 
> wrote:
>
>> Hi first name,
>>
>> If you optimize your database, the indexes will be rebuilt. In this
>> step, the builder tries to guess how much free memory is still
>> available. If memory is exhausted, parts of the index will be split
>> (i. e., partially written to disk) and merged in a final step.
>> However, you can circumvent the heuristics by manually assigning a
>> static split value; see [1] for more information. If you use the DBA,
>> you’ll need to assign this value to your .basex or the web.xml file
>> [2]. In order to find the best value for your setup, it may be easier
>> to play around with the BaseX GUI.
>>
>> As you have already seen in our statistics, an XML document has
>> various properties that may represent a limit for a single database.
>> Accordingly, these properties make it difficult to decide for the
>> system when the memory will be exhausted during an import or index
>> rebuild.
>>
>> In general, you’ll get best performance (and your memory consumption
>> will be lower) if you create your database and specify the data to be
>> imported in a single run. This is currently not possible via the DBA;
>> use the GUI (Create Database) or console mode (CREATE DB command)
>> instead.
>>
>> Hope this helps,
>> Christian
>>
>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
>> [2] http://docs.basex.org/wiki/Configuration
>>
>>
>>
>> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>>  wrote:
>> >
>> > Hi,
>> >
>> > Let's say there's a 30GB dataset [3] containing most threads/posts from
>> [1].
>> > After importing all of it, when I try to run /dba/db-optimize/ on it
>> (which must have some corresponding command) I get the OOM error in the
>> stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of memory
>> (the machine I'm running this on doesn't have a lot of memory).
>> > I was looking at [2] for some estimates of peak memory usage for this
>> "db-optimize" operation, but couldn't find any.
>> > Actually it would be nice to know peak memory usage because.. of
>> course, for any database (including BaseX) a common operation is to do
>> server sizing, to know what kind of server would be needed.
>> > In this case, it seems like 2GB memory is enough to impor

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-02 Thread first name last name
Hey Christian,

Thank you for your answer :)
I tried setting in .basex the SPLITSIZE = 24000 but I've seen the same OOM
behavior. It looks like the memory consumption is moderate until when it
reaches about 30GB (the size of the db before optimize) and
then memory consumption spikes, and OOM occurs. Now I'm trying with
SPLITSIZE = 1000 and will report back if I get OOM again.
Regarding what you said, it might be that the merge step is where the OOM
occurs (I wonder if there's any way to control how much memory is being
used inside the merge step).

To quote the statistics page from the wiki:
Databases <http://docs.basex.org/wiki/Databases> in BaseX are
light-weight. If a database limit is reached, you can distribute your
documents across multiple database instances and access all of them with a
single XQuery expression.
This to me sounds like sharding. I would probably be able to split the
documents into chunks and upload them under a db with the same prefix, but
varying suffix.. seems a lot like shards. By doing this
I think I can avoid OOM, but if BaseX provides other, better, maybe native
mechanisms of avoiding OOM, I would try them.

Best regards,
Stefan


On Tue, Oct 1, 2019 at 4:22 PM Christian Grün 
wrote:

> Hi first name,
>
> If you optimize your database, the indexes will be rebuilt. In this
> step, the builder tries to guess how much free memory is still
> available. If memory is exhausted, parts of the index will be split
> (i. e., partially written to disk) and merged in a final step.
> However, you can circumvent the heuristics by manually assigning a
> static split value; see [1] for more information. If you use the DBA,
> you’ll need to assign this value to your .basex or the web.xml file
> [2]. In order to find the best value for your setup, it may be easier
> to play around with the BaseX GUI.
>
> As you have already seen in our statistics, an XML document has
> various properties that may represent a limit for a single database.
> Accordingly, these properties make it difficult to decide for the
> system when the memory will be exhausted during an import or index
> rebuild.
>
> In general, you’ll get best performance (and your memory consumption
> will be lower) if you create your database and specify the data to be
> imported in a single run. This is currently not possible via the DBA;
> use the GUI (Create Database) or console mode (CREATE DB command)
> instead.
>
> Hope this helps,
> Christian
>
> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
> [2] http://docs.basex.org/wiki/Configuration
>
>
>
> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>  wrote:
> >
> > Hi,
> >
> > Let's say there's a 30GB dataset [3] containing most threads/posts from
> [1].
> > After importing all of it, when I try to run /dba/db-optimize/ on it
> (which must have some corresponding command) I get the OOM error in the
> stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of memory
> (the machine I'm running this on doesn't have a lot of memory).
> > I was looking at [2] for some estimates of peak memory usage for this
> "db-optimize" operation, but couldn't find any.
> > Actually it would be nice to know peak memory usage because.. of course,
> for any database (including BaseX) a common operation is to do server
> sizing, to know what kind of server would be needed.
> > In this case, it seems like 2GB memory is enough to import 340k
> documents, weighing in at 30GB total, but it's not enough to run
> "dba-optimize".
> > Is there any info about peak memory usage on [2] ? And are there
> guidelines for large-scale collection imports like I'm trying to do?
> >
> > Thanks,
> > Stefan
> >
> > [1] https://www.linuxquestions.org/
> > [2] http://docs.basex.org/wiki/Statistics
> > [3] https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2
>


[basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-09-29 Thread first name last name
Hi,

Let's say there's a 30GB dataset [3] containing most threads/posts from [1].
After importing all of it, when I try to run /dba/db-optimize/ on it (which
must have some corresponding command) I get the OOM error in the stacktrace
attached. I am using -Xmx2g so BaseX is limited to 2GB of memory (the
machine I'm running this on doesn't have a lot of memory).
I was looking at [2] for some estimates of peak memory usage for this
"db-optimize" operation, but couldn't find any.
Actually it would be nice to know peak memory usage because.. of course,
for any database (including BaseX) a common operation is to do server
sizing, to know what kind of server would be needed.
In this case, it seems like 2GB memory is enough to import 340k documents,
weighing in at 30GB total, but it's not enough to run "dba-optimize".
Is there any info about peak memory usage on [2] ? And are there guidelines
for large-scale collection imports like I'm trying to do?

Thanks,
Stefan

[1] https://www.linuxquestions.org/
[2] http://docs.basex.org/wiki/Statistics
[3] https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2
java.io.FileNotFoundException: 
/share/CACHEDEV1_DATA/Public/builds/basex/data/linuxquestions.org_938223018/inf.basex
 (No such file or directory)
   
at java.io.FileOutputStream.open0(Native Method)

   
at java.io.FileOutputStream.open(FileOutputStream.java:270) 

   
at java.io.FileOutputStream.(FileOutputStream.java:213)   

   
at java.io.FileOutputStream.(FileOutputStream.java:162)   

   
at org.basex.io.IOFile.outputStream(IOFile.java:158)

   
at org.basex.io.out.DataOutput.(DataOutput.java:47)   

   
at org.basex.io.out.DataOutput.(DataOutput.java:36)   

   
at org.basex.data.DiskData.write(DiskData.java:137) 

   
at org.basex.data.DiskData.close(DiskData.java:160) 

   
at org.basex.core.cmd.OptimizeAll.optimizeAll(OptimizeAll.java:145) 

   
at 
org.basex.query.up.primitives.db.DBOptimize.apply(DBOptimize.java:124)  


at org.basex.query.up.DataUpdates.apply(DataUpdates.java:175)   

   
at org.basex.query.up.ContextModifier.apply(ContextModifier.java:120)   

   
at org.basex.query.up.Updates.apply(Updates.java:178)   

   
at org.basex.query.QueryContext.update(QueryContext.java:701)   

   
at org.basex.query.QueryContext.iter(QueryContext.java:332) 

   
at 
org.basex.http.restxq.RestXqResponse.serialize(RestXqResponse.java:73)  


at