On 16/11/17 07:53, Osma Suominen wrote:
Hi all (especially Andy)
I am weeks behind on email.
> 4. Should there be a JIRA issue about the bad Content-Length values
reported by Fuseki?
I don't see any connection to TDB2.
Please separate this out into another email - what is the problem and
does it apply to the current codebase?
(NB I have encountered some tools getting the Content-Length wrong.)
Andy
I'd appreciate some kind of response to the questions at the end of the
below message. I can open the JIRA issues if necessary. I realize this
was posted during the hectic release process.
I have done some further testing for TDB & TDB2 disk space usage, will
write a separate message about that, probably this time on users@
-Osma
Osma Suominen kirjoitti 27.10.2017 klo 13:44:
Hi,
As I've promised earlier I took TDB2 for a little test drive, using
the 3.5.0rc1 builds.
I tested two scenarios: A server running Fuseki, and command line
tools operating directly on a database directory.
1. Server running Fuseki
First the server (running as a VM). Currently I've been using Fuseki
with HDT support, from the hdt-java repository. I'm serving a dataset
of about 39M triples, which occasionally changes (eventually this will
be updated once per month, or perhaps more frequently, even once per
day). With HDT, I can simply rebuild the HDT file (less than 10
minutes) and then restart Fuseki. Downtime for the endpoint is only a
few seconds. But I'm worried about the state of the hdt-java project,
it is not being actively maintained and it's still based on Fuseki1.
So I switched (for now) to Fuseki2 with TDB2. It was rather smooth
thanks to the documentation that Andy provided. I usually create
Fuseki2 datasets via the API (using curl), but I noticed that, like
the UI, the API only supports "mem" and "tdb". So I created a "tdb"
dataset first, then edited the configuration file so it uses tdb2
instead.
Loading the data took about 17 minutes. I used wget for this, per
Andy's example. This is a bit slower than regenerating the HDT, but
acceptable since I'm only doing it occasionally. I also tested
executing queries while reloading the data. This seemed to work OK
even though performance obviously did suffer. But at least the
endpoint remained up.
The TDB2 directory ended up at 4.6GB. In contrast, the HDT file +
index for the same data is 560MB.
I reloaded the same data, and the TDB2 directory grew to 8.5GB, almost
twice its original size. I understand that the TDB2 needs to be
compacted regularly, otherwise it will keep growing. I'm OK with the
large disk space usage if it's constant, not growing over time like TDB1.
2. Command line tools
For this I used an older version of the same dataset with 30M triples,
the same one I used for my HDT vs TDB comparison that I posted on the
users mailing list:
http://mail-archives.apache.org/mod_mbox/jena-users/201704.mbox/%3C90c0130b-244d-f0a7-03d3-83b47564c990%40iki.fi%3E
This was on my i3-2330M laptop with 8GB RAM and SSD.
Loading the data using tdb2.tdbloader took about 18 minutes (about 28k
triples per second). The TDB2 directory is 3.7GB. In contrast, using
tdbloader2, loading took 11 minutes and the TDB directory was 2.7GB.
So TDB2 is slower to load and takes more disk space than TDB.
I ran the same example query I used before on the TDB2. The first time
was slow (33 seconds), but subsequent queries took 16.1-18.0 seconds.
I also re-ran the same query on TDB using tdbquery on Jena 3.5.0rc1.
The query took 13.7-14.0 seconds after the first run (24 seconds).
I also reloaded the same data to the TDB2 to see the effect. Reloading
took 11 minutes and the database grew to 5.7GB. Then I compacted it
using tdb2.tdbcompact. Compacting took 18 minutes and the disk usage
just grew further, to 9.7GB. The database directory then contained
both Data-0001 and Data-0002 directories. I removed Data-0001 and disk
usage fell to 4.0GB. Not quite the same as the original 3.7GB, but close.
My impressions so far: It works, but it's slower than TDB and needs
more disk space. Compaction seems to work, but initially it will just
increase disk usage. The stale data has to be manually removed to
actually reclaim any space. I didn't test subsequent load/compact
cycles, but I assume there may still be some disk space growth (e.g.
due to blank nodes, of which there are some in my dataset) even if the
data is regularly compacted.
For me, not growing over time like TDB is really the crucial feature
that TDB2 seems to promise. Right now it's not clear whether it
entirely fulfills this promise, since compaction needs to be done
manually and doesn't actually reclaim disk space by itself.
Questions/suggestions:
1. Is is possible to trigger a TDB2 compaction from within Fuseki? I'd
prefer not taking the endpoint down for compaction.
2. Should the stale data be deleted after compaction, at least as an
option?
3. Should there be a JIRA issue about UI and API support for creating
TDB2 datasets?
4. Should there be a JIRA issue about the bad Content-Length values
reported by Fuseki?
-Osma