Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-08 Thread Christian Grün
I was surprised to see the 16 GB RAM machine pop up in your setup. Did you
check how many gigabytes of XML data can fulltext-indexed with BaseX (and a
large -Xmx value, maybe 15g) on that system?





first name last name  schrieb am Di., 8. Okt. 2019,
22:40:

> On Mon, Oct 7, 2019 at 1:13 AM Christian Grün 
> wrote:
>
>>
>> I would recommend you to write SQL commands or an SQL dump to disk (see
>> the BaseX File Module for now information) and run/import this file in a
>> second step; this is probably faster than sending hundreds of thousands of
>> single SQL commands via JDBC, no matter if you are using XQuery or Java.
>>
>>
> Ok, so I finally managed to reach a compromise regarding BaseX
> capabilities and the hardware that I have at my disposal (for the time
> being).
> This message will probably answer thread [1] as well (which is separate
> from this but seems to ask the same question basically, which is, how to
> use BaseX as an command-line XQuery processor).
> The script attached will take a large collection of HTML documents, it
> will pack them into small "balanced" sets, and then it will run XQuery on
> them using BaseX.
> The result will be a lot of SQL files ready to be imported in PostgreSQL
> (with some small tweaks, the data could be adapted to be imported in
> Elasticsearch).
>
> I'm also including some benchmark data:
>
> On system1 the following times were recorded: If run with -j4 it does 200
> forum thread pages in 10 seconds.
> And apparently there's about 5 posts on average per thread page. So in
> 85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum
> thread pages) and have them prepared to be imported in PostgreSQL. With -j4
> the observed peak memory usage was 500MB.
>
> I've tested the script attached on the following two systems:
> system1 config:
> - BaseX 9.2.4
> - script (from util-linux 2.31.1)
> - GNU Parallel 20161222
> - Ubuntu 18.04 LTS
>
> system1 hardware:
> - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores)
> - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
> - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM
>
> system2 config:
> - BaseX 9.2.4
> - GNU Parallel 20181222
> - script (from util-linux 2.34)
>
> system2 hardware:
> - cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
> - memory: 4GB RAM DDR @ 1600MHz
> - disk: HDD ST3000VN007-2E4166 @ 5900 rpm
>
> [1]
> https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html
>
>


Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM

2019-10-08 Thread first name last name
On Mon, Oct 7, 2019 at 1:13 AM Christian Grün 
wrote:

>
> I would recommend you to write SQL commands or an SQL dump to disk (see
> the BaseX File Module for now information) and run/import this file in a
> second step; this is probably faster than sending hundreds of thousands of
> single SQL commands via JDBC, no matter if you are using XQuery or Java.
>
>
Ok, so I finally managed to reach a compromise regarding BaseX capabilities
and the hardware that I have at my disposal (for the time being).
This message will probably answer thread [1] as well (which is separate
from this but seems to ask the same question basically, which is, how to
use BaseX as an command-line XQuery processor).
The script attached will take a large collection of HTML documents, it will
pack them into small "balanced" sets, and then it will run XQuery on them
using BaseX.
The result will be a lot of SQL files ready to be imported in PostgreSQL
(with some small tweaks, the data could be adapted to be imported in
Elasticsearch).

I'm also including some benchmark data:

On system1 the following times were recorded: If run with -j4 it does 200
forum thread pages in 10 seconds.
And apparently there's about 5 posts on average per thread page. So in
85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum
thread pages) and have them prepared to be imported in PostgreSQL. With -j4
the observed peak memory usage was 500MB.

I've tested the script attached on the following two systems:
system1 config:
- BaseX 9.2.4
- script (from util-linux 2.31.1)
- GNU Parallel 20161222
- Ubuntu 18.04 LTS

system1 hardware:
- cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores)
- memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
- disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM

system2 config:
- BaseX 9.2.4
- GNU Parallel 20181222
- script (from util-linux 2.34)

system2 hardware:
- cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
- memory: 4GB RAM DDR @ 1600MHz
- disk: HDD ST3000VN007-2E4166 @ 5900 rpm

[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html
#!/bin/bash
#
# This script leverages BaseX as an XQuery command line processor
# by using multiple small disposable BaseX databases, and parallelizing the 
entire processing.
# It will essentially run XQuery in batches on large data sets, and produce
# SQL insert statements, so the data can be imported into PostgreSQL.
#
# We're packing files for processing, and we're trying to balance them out in 
sets
# such that two constraints are met:
# - no more than 100 files per set
# - no more than 100*90k bytes per set
#
# Timing:
#
# On system1 the following times were recorded:
# If run with -j4 it does 200 thread pages in 10 seconds.
# And apparently there's about 5 posts on average per thread page.
# so in 85000 seconds (which is almost a day).
# So in a day, it would process ~1.7M posts (in 340k forum thread pages)
# and have them prepared to be imported in PostgreSQL.
# Again, for -j4, the observed peak memory usage was 500MB.
#
# Notes:
#
# 1)
# The following error(found through strace) would manifest itself because
# of GNU Parallel mainly:
# --- stopped by SIGTTOU --- 
# It's also described here: 
# https://notmuchmail.org/pipermail/notmuch/2019/028015.html
# It can be circumvented throuhg the use of script
# (script - make typescript of terminal session)
#
# 2) --linebuffer is used for GNU Parallel so it can write to stdout as
# soon as possible.
#
#
# system1 config:
# - BaseX 9.2.4
# - script (from util-linux 2.31.1)
# - GNU Parallel 20161222
# - Ubuntu 18.04 LTS
# 
# system1 hardware:
# - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
# - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
# - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM 
#
# system2 config:
# - BaseX 9.2.4
# - GNU Parallel 20181222
# - script (from util-linux 2.34)
#
# system2 hardware:
# - cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
# - memory: 4GB RAM DDR @ 1600MHz
# - disk: HDD ST3000VN007-2E4166 @ 5900 rpm
#
#

BASEX="$HOME/basex-preloaded-forums/bin/basex"
mkdir meta-process
echo "Partitioning files into different sets ..."
#fpart -f 100 -s $((100 * 9)) `pwd`/threads/ -o meta-process/files-shard

proc() {
s="$1"
f="$2"
j="$3"

echo "$s -- $j -- $f"

# Build script to create temp db, and import all the html files
SHARD_IMPORT_SCRIPT=$(pwd)"/tmp/import.script.$s"
SHARD_PROCESS_SCRIPT=$(pwd)"/tmp/process.script.$s"
SHARD_SQL_OUT=$(pwd)"/tmp/partial.$j.sql"

cat << EOF > $SHARD_IMPORT_SCRIPT
DROP   DB tmp-$s
CREATE DB tmp-$s
SET PARSER html
SET HTMLPARSER 
method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
SET CREATEFILTER *.html
EOF
cat $f | perl -pne 's{^}{ADD }g;' >> $SHARD_IMPORT_SCRIPT ;
 
script --return -c "$BASEX < $SHARD_IMPORT_SCRIPT >/dev/null ; echo 'Importing 
Done'; "
 
# Build processing script, to pull values and build SQL queries
echo "for \$x in