On 28/01/15 12:23, Stian Soiland-Reyes wrote:
On 28 January 2015 at 10:34, Andy Seaborne <[email protected]> wrote:tdbloader does not do better when it's an existing, non-empty database. It avoids some transactional scaling issues but otherwise uploading to the server live is much the same.Uploading many files of several GB of data over HTTP sounds fragile to me.. and anyway, at least in my case, I don't have those files on the machine with the browser :)
curl and GSP. If you are ultra clever/careful/... , rsync.
tdbloader2 might be good for the biggest stuff, though, as for me at least it seems to give a big improvement in performance - but at the risk of not loading anything at all if something goes wrong with one of the files, as it does indexing in the end. (?) BTW - is it normal that tdbloader performance decrease as more triples go in? Getting more expensive to maintain the indexes or match existing identifiers? INFO Add: 338,100,000 triples (Batch: 2,853 / Avg: 6,824) INFO Add: 351,550,000 triples (Batch: 3,943 / Avg: 6,503) INFO Add: 357,200,000 triples (Batch: 7,685 / Avg: 6,195) INFO Add: 371,900,000 triples (Batch: 6,900 / Avg: 5,565) INFO Add: 386,000,000 triples (Batch: 4,506 / Avg: 5,094)
Yes.Both the node table and the SPO or GSPO indexes slow down over time. Mainly the node table; that also affects stdbloader2.
This is from before I understood JVM_ARGS so it's probably memory unbounded (my tdbloader didn't set -Xmx), using about 5 GB or so of heap.
Too much.On 64 bit - most of the work and index bytes are in memory mapped files, not the heap.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20203 root 20 0 21.655g 5.066g 3.225g S 30.6 74.7 257:36.85 java (it's IO-bound - I don't have those fancy SSD raids at home :)Don't follow. config-tdb-dir is that minimal config isn't it? The templates take NAME and DIRectory. That's it.Right, I can replace those variables with sed, so that should work - I guess from the dist I could just unzip the template from the fuseki-server.jar. (Or should it perhaps better be exposed in the dist?)https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-fuseki-dist/2.0.0-SNAPSHOT/maven-metadata.xmlThere is a timestamp and an incremental count (ATM "21") jena-fuseki-dist-2.0.0-20150128.100051-21.zipI already parse that with xpath: https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki-docker/Dockerfile#L52 .. but a more robust way would be to temporary install mvn and have a dummy pom.xml which <version> is updated from the parent. I went with xpath for now as that would need more cleanup.. Maven would download many things that it won't really need, that then again must be deleted from /root/.m2 Would there be general interest in adding the Fuseki2 docker image to Jena, or have I made another novelty thing :) ? (Here is Virtuoso: https://registry.hub.docker.com/u/stain/virtuoso/
If it's universal.If there are any "decisions", (and give any need to reset shiro.ini before containization there might well be) adding the instructions might be better.
Centralization of artifacts can be convenient but also a limitation on the team.
If it went in, it would need to add that kind of Maven polishing, of course, so that it works smoothly in releases. I don't think I would try to get Maven to actually build the Docker image, that would put in quite strong OS requirements. Presumably - if Jena was to upload such an image officially - it would then have to be voted over as (part of) a release, even though it would primarily contain the fuseki2 dist.
Yes.
Would there be licensing issues over the docker image depending on Linux and OpenJDK (or Oracle JDK)? Docker folks seem to just not worry much about licensing :-/ There could be some issues with layering - my image now contain one layer which adds both pwgen (GPL) and Fuseki (Apache 2.0) - but that's easy enough to split. (or FreeBSD to the rescue!?) There's also this docker image by Tazro Inutano Ohta: https://registry.hub.docker.com/u/inutano/jena/dockerfile/ which simply provides the official Jena distribution under /apache-jena-2.12.1/ With a bit more work this could allow usage of the tdb commands as-is.
Andy
