[basex-talk] // versus /*/
Do these have the same meaning? Might there be a subtle distinction, or might they be read differently but functionally identical? > > xquery /*/book[@id="bk112"] Galos, Mike Visual Studio 7: A Comprehensive Guide Computer 49.95 2001-04-16 Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment. Query executed in 1.22 ms. > > xquery //book[@id="bk112"]/title Visual Studio 7: A Comprehensive Guide Query executed in 1.52 ms. > > xquery /*/book[@id="bk112"]/title Visual Studio 7: A Comprehensive Guide Query executed in 1.67 ms. > They're equally efficient, at least as used above? thanks, Thufir
Re: [basex-talk] get list of databases
thanks, worked perfectly. On 2019-10-06 3:42 p.m., Christian Grün wrote: Try new List().execute(ctx); See [1] for details. Cheers, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/core/cmd/List.java#L28
Re: [basex-talk] get list of databases
Try new List().execute(ctx); See [1] for details. Cheers, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/core/cmd/List.java#L28 thufir schrieb am Mo., 7. Okt. 2019, 00:35: > Hi, > > pardon the silly question, but how do I get a list of databases exactly? > I'm looking at: > > > https://github.com/BaseXdb/basex/blob/master/basex-examples/src/main/java/org/basex/examples/local/CreateCollection.java > > and want output like: > > > list > Name Resources Size Input Path > -- > books 0 4570 > > 1 database(s). > > > > > but using: > > http://docs.basex.org/javadoc/org/basex/core/cmd/List.html > > > > > thanks, > > Thufir >
[basex-talk] get list of databases
Hi, pardon the silly question, but how do I get a list of databases exactly? I'm looking at: https://github.com/BaseXdb/basex/blob/master/basex-examples/src/main/java/org/basex/examples/local/CreateCollection.java and want output like: > list Name Resources Size Input Path -- books 0 4570 1 database(s). > but using: http://docs.basex.org/javadoc/org/basex/core/cmd/List.html thanks, Thufir
Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM
Some complementary notes (others may be able to tell you more about their experiences with large data sets): a GiST index would have to be built there, to allow full-text searches; > PostgreSQL is picked > You could as well have a look at Elasticsearch or its predecessors. there might be a leak in the BaseX implementation of XQuery. > I assume you are referring to the SQL Module? Feel free to attach the OOM stack trace, it might give us more insight. I would recommend you to write SQL commands or an SQL dump to disk (see the BaseX File Module for now information) and run/import this file in a second step; this is probably faster than sending hundreds of thousands of single SQL commands via JDBC, no matter if you are using XQuery or Java.
[basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM
Hello, This is essentially part2 of trying to index large amounts of web data. To summarize what happened before: The initial discussion started here [1], Christian suggested some options, I dove into each of them, I realized that doing this on a low-memory system is harder than I initially thought. At Christian's suggestion, I tried to split the big db into smaller dbs and came up with a rudimentary sharding mechanism [3]. All attempts to full-text 30GB of data in BaseX, for me, resulted in OOM (do take into consideration that I only have 3.1GB of memory to allocate for BaseX). Where to? I decided to look more into what Christian said in [2] about option 2, and to pick the exact values that I want, and to transfer them to PostgreSQL (after transferring, a GiST index would have to be built there, to allow full-text searches; PostgreSQL is picked because it uses an in-memory buffer for all large operations, and several files on disk, and if it needs to combine results that exceed the available memory, it goes to disk, but at all times it never exceeds the given amount of memory). Variant 1 (see attached script pg-import.sh) All good. So, I basically started writing XQuery that would do the following: - Open up a JDBC connection to PostgreSQL - Get me all text content from each thread page of the forum, and the db it belonged to - Create a prepared statement for one such thread page, populate the prepared statement, and execute it This ended up in OOM after around 250k records. So just to be clear, 250k lines were rows in PostgreSQL, which is nice but eventually it ended up in OOM. (Perhaps it has to do with how the GC works in Java .. I don't know) Variant 2 (see attached script pg-import2.sh) I did something similar to the above: - Open up a JDBC connection to PostgreSQL - Get all posts and for each post get the author, the date, the message content, the post id, the BaseX db name (cause we're going over all shards, and each shard is a BaseX db) - Create a prepared statement for each post with the data mentioned above, and execute it This also ended up in OOM after around 340k records (my approximation would be that there were around 3M posts in the data). To summarize, I'm tempted to believe that there might be a leak in the BaseX implementation of XQuery. I will provide in the following, the relevant versions of the software used: - BaseX 9.2.4 - java version "1.8.0_151" Java(TM) SE Runtime Environment (build 1.8.0_151-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode) - the JVM memory param value was -Xmx3100m I would be interested to know your thoughts Thanks, Stefan [1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-September/014715.html [2] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014727.html [3] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014729.html #!/bin/bash mkdir tmp # TODO: work on the full-text part in PostgreSQL # a trigger will be required to make it work. cat << EOF > tmp/archive-schema.sql CREATE DATABASE "archive"; \c "archive" CREATE EXTENSION IF NOT EXISTS btree_gist; DROP TABLE IF EXISTS thread; CREATE TABLE IF NOT EXISTS thread ( id SERIAL PRIMARY KEY, content TEXT, label VARCHAR(300), forum VARCHAR(300) ); -- CREATE INDEX idx_thread_content ON thread USING gist(content); CREATE INDEX idx_thread_label ON thread(label); CREATE UNIQUE INDEX idx_thread_uniq_label_forum ON thread(label,forum); EOF LD_LIBRARY_PATH="" /share/Public/builds/prefix/bin/psql -U postgres -d postgres < tmp/archive-schema.sql cat << 'EOF' > tmp/import.xq let $conn-string:="jdbc:postgresql://localhost:5432/archive?user=postgrespassword=postgres" let $pgconn := sql:connect($conn-string) let $dbs:=fn:filter(db:list(), function($x){ matches($x,"linuxquestions-shard-") }) for $db in fn:reverse(fn:sort($dbs)) for $x in db:open($db) let $label:=$x/fn:base-uri() let $content:=$x//*[matches(@id,"post_message_")]/text() let $params := { $label } { $content } { $db } let $prep:=sql:prepare($pgconn, "INSERT INTO thread(label,content,forum) VALUES(?,?,?)") return try { sql:execute-prepared($prep,$params) } catch * { 'Error [' || $err:code || ']: ' || $err:description || '--' || $params } EOF /share/Public/builds/basex/bin/basexclient -U admin -P admin tmp/import.xq #!/bin/bash mkdir tmp # TODO: work on the full-text part in PostgreSQL # a trigger will be required to make it work. # # DONE: more detailed content extraction. cat << EOF > tmp/archive-schema.sql CREATE DATABASE "archive"; \c "archive" CREATE EXTENSION IF NOT EXISTS btree_gist; DROP TABLE IF EXISTS thread2; CREATE TABLE IF NOT EXISTS thread2 ( id SERIAL PRIMARY KEY, date VARCHAR(300), author VARCHAR(300), post_id VARCHAR(300), doc_uri VARCHAR(300), content TEXT, basex_db VARCHAR(300) ); CREATE INDEX idx_thread2_date ON thread2(date); CREATE
Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/
Regarding selective full-text indexing, I just tried XQUERY db:optimize("linuxquestions.org-selective", true(), map { 'ftindex': true(), 'ftinclude': 'div table td a' }) And I got OOM on that, the exact stacktrace attached in this message. I will open a separate thread regarding migrating the data from BaseX shards to PostgreSQL (for the purpose of full-text indexing). On Sun, Oct 6, 2019 at 10:19 AM Christian Grün wrote: > The current full text index builder provides a similar outsourcing > mechanism to that of the index builder for the default index structures; > but the meta data structures are kept in main-memory, and they are more > bulky. There are definitely ways to tackle this technically; it hasn't been > of high priority so far, but this may change. > > Please note that you won't create an index over your whole data set in > RDBMS. Instead, you'll usually create it for specific fields that you will > query later on. It's a convenience feature in BaseX that you can build an > index for all of your data. For large full-text corpora, however, it's > recommendable in most cases to restrict indexing to the relevant XML > elements. > > > > > first name last name schrieb am Sa., 5. Okt. > 2019, 23:28: > >> Attached a more complete output of ./bin/basexhttp . Judging from this >> output, it would seem that everything was ok, except for the full-text >> index. >> I now realize that I have another question about full-text indexes. It >> seems like the full-text index here is dependent on the amount of memory >> available (in other words, the more data to be indexed, the more RAM memory >> required). >> >> I was using a certain popular RDBMS, for full-text indexing, and I never >> bumped into problems like it running out of memory when building such >> indexes. >> I think their model uses a certain buffer in memory, and it keeps >> multiple files on disk where it store data, and then it assembles together >> the results in-memory >> but always keeping the constraint of using only as much memory as was >> declared to be allowed for it to use. >> Perhaps the topic would be "external memory algorithms" or "full-text >> search using secondary storage". >> I'm not an expert in this field, but.. my question here would be if this >> kind of thing is something that BaseX is looking to handle in the future? >> >> Thanks, >> Stefan >> >> >> On Sat, Oct 5, 2019 at 11:08 PM Christian Grün >> wrote: >> >>> The stack Trace indicates that you enabled the fulltext index as well. >>> For this index, you definitely need more memory than available on your >>> system. >>> >>> So I assume you didn't encounter trouble with the default index >>> structures? >>> >>> >>> >>> >>> first name last name schrieb am Sa., 5. Okt. >>> 2019, 20:52: >>> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can allocate on that system for BaseX) and I got OOM. On Sat, Oct 5, 2019 at 2:19 AM Christian Grün < christian.gr...@gmail.com> wrote: > About option 1: How much memory have you been able to assign to the > Java VM? > > > > > > first name last name schrieb am Sa., 5. Okt. > 2019, 01:11: > >> I had another look at the script I wrote and realized that it's not >> working as it's supposed to. >> Apparently the order of operations should be this: >> - turn on all the types of indexes required >> - create the db >> - the parser settings and the filter settings >> - add all the files to the db >> - run "OPTIMIZE" >> >> If I'm not doing them in this order (specifically with "OPTIMIZE" at >> the end) the resulting db lacks all indexes. >> >> >> >> On Fri, Oct 4, 2019 at 11:32 PM first name last name < >> randomcod...@gmail.com> wrote: >> >>> Hi Christian, >>> >>> About option 4: >>> I agree with the options you laid out. I am currently diving deeper >>> into option 4 in the list you wrote. >>> Regarding the partitioning strategy, I agree. I did manage however >>> to partition the files to be imported, into separate sets, with a >>> constraint on max partition size (on disk) and max partition file count >>> (the number of XML documents in each partition). >>> The tool called fpart [5] made this possible (I can imagine more >>> sophisticated bin-packing methods, involving pre-computed node count >>> values, and other variables, can be achieved via glpk [6] but that >>> might be >>> too much work). >>> So, currently I am experimenting with a max partition size of 2.4GB >>> and a max file count of 85k files, and fpart seems to have split the >>> file >>> list into 11 partitions of 33k files each and the size of a partition >>> being >>> ~ 2.4GB. >>> So, I wrote a script for this, it's called sharded-import.sh and >>> attached here. I'm also noticing that the /dba/ BaseX web interface is >>> not
Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/
The current full text index builder provides a similar outsourcing mechanism to that of the index builder for the default index structures; but the meta data structures are kept in main-memory, and they are more bulky. There are definitely ways to tackle this technically; it hasn't been of high priority so far, but this may change. Please note that you won't create an index over your whole data set in RDBMS. Instead, you'll usually create it for specific fields that you will query later on. It's a convenience feature in BaseX that you can build an index for all of your data. For large full-text corpora, however, it's recommendable in most cases to restrict indexing to the relevant XML elements. first name last name schrieb am Sa., 5. Okt. 2019, 23:28: > Attached a more complete output of ./bin/basexhttp . Judging from this > output, it would seem that everything was ok, except for the full-text > index. > I now realize that I have another question about full-text indexes. It > seems like the full-text index here is dependent on the amount of memory > available (in other words, the more data to be indexed, the more RAM memory > required). > > I was using a certain popular RDBMS, for full-text indexing, and I never > bumped into problems like it running out of memory when building such > indexes. > I think their model uses a certain buffer in memory, and it keeps multiple > files on disk where it store data, and then it assembles together the > results in-memory > but always keeping the constraint of using only as much memory as was > declared to be allowed for it to use. > Perhaps the topic would be "external memory algorithms" or "full-text > search using secondary storage". > I'm not an expert in this field, but.. my question here would be if this > kind of thing is something that BaseX is looking to handle in the future? > > Thanks, > Stefan > > > On Sat, Oct 5, 2019 at 11:08 PM Christian Grün > wrote: > >> The stack Trace indicates that you enabled the fulltext index as well. >> For this index, you definitely need more memory than available on your >> system. >> >> So I assume you didn't encounter trouble with the default index >> structures? >> >> >> >> >> first name last name schrieb am Sa., 5. Okt. >> 2019, 20:52: >> >>> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can >>> allocate on that system for BaseX) and I got OOM. >>> >>> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün >>> wrote: >>> About option 1: How much memory have you been able to assign to the Java VM? first name last name schrieb am Sa., 5. Okt. 2019, 01:11: > I had another look at the script I wrote and realized that it's not > working as it's supposed to. > Apparently the order of operations should be this: > - turn on all the types of indexes required > - create the db > - the parser settings and the filter settings > - add all the files to the db > - run "OPTIMIZE" > > If I'm not doing them in this order (specifically with "OPTIMIZE" at > the end) the resulting db lacks all indexes. > > > > On Fri, Oct 4, 2019 at 11:32 PM first name last name < > randomcod...@gmail.com> wrote: > >> Hi Christian, >> >> About option 4: >> I agree with the options you laid out. I am currently diving deeper >> into option 4 in the list you wrote. >> Regarding the partitioning strategy, I agree. I did manage however to >> partition the files to be imported, into separate sets, with a constraint >> on max partition size (on disk) and max partition file count (the number >> of >> XML documents in each partition). >> The tool called fpart [5] made this possible (I can imagine more >> sophisticated bin-packing methods, involving pre-computed node count >> values, and other variables, can be achieved via glpk [6] but that might >> be >> too much work). >> So, currently I am experimenting with a max partition size of 2.4GB >> and a max file count of 85k files, and fpart seems to have split the file >> list into 11 partitions of 33k files each and the size of a partition >> being >> ~ 2.4GB. >> So, I wrote a script for this, it's called sharded-import.sh and >> attached here. I'm also noticing that the /dba/ BaseX web interface is >> not >> blocked anymore if I run this script, as opposed to running the previous >> import where I run >> CREATE DB db_name /directory/ >> which allows me to see the progress or allows me to run queries >> before the big import finishes. >> Maybe the downside is that it's more verbose, and prints out a ton of >> lines like >> > ADD /share/Public/archive/tech-sites/ >> linuxquestions.org/threads/viewtopic_9_356613.html >> Resource(s) added in 47.76 ms. >> along the way, and maybe that's slower than before. >> >> About option 1: >>