On 28 January 2015 at 10:34, Andy Seaborne <[email protected]> wrote: > tdbloader does not do better when it's an existing, non-empty database. It > avoids some transactional scaling issues but otherwise uploading to the > server live is much the same.
Uploading many files of several GB of data over HTTP sounds fragile to me.. and anyway, at least in my case, I don't have those files on the machine with the browser :) tdbloader2 might be good for the biggest stuff, though, as for me at least it seems to give a big improvement in performance - but at the risk of not loading anything at all if something goes wrong with one of the files, as it does indexing in the end. (?) BTW - is it normal that tdbloader performance decrease as more triples go in? Getting more expensive to maintain the indexes or match existing identifiers? INFO Add: 338,100,000 triples (Batch: 2,853 / Avg: 6,824) INFO Add: 351,550,000 triples (Batch: 3,943 / Avg: 6,503) INFO Add: 357,200,000 triples (Batch: 7,685 / Avg: 6,195) INFO Add: 371,900,000 triples (Batch: 6,900 / Avg: 5,565) INFO Add: 386,000,000 triples (Batch: 4,506 / Avg: 5,094) This is from before I understood JVM_ARGS so it's probably memory unbounded (my tdbloader didn't set -Xmx), using about 5 GB or so of heap. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20203 root 20 0 21.655g 5.066g 3.225g S 30.6 74.7 257:36.85 java (it's IO-bound - I don't have those fancy SSD raids at home :) > Don't follow. config-tdb-dir is that minimal config isn't it? > The templates take NAME and DIRectory. That's it. Right, I can replace those variables with sed, so that should work - I guess from the dist I could just unzip the template from the fuseki-server.jar. (Or should it perhaps better be exposed in the dist?) >> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-fuseki-dist/2.0.0-SNAPSHOT/maven-metadata.xml > > > There is a timestamp and an incremental count (ATM "21") > jena-fuseki-dist-2.0.0-20150128.100051-21.zip I already parse that with xpath: https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki-docker/Dockerfile#L52 .. but a more robust way would be to temporary install mvn and have a dummy pom.xml which <version> is updated from the parent. I went with xpath for now as that would need more cleanup.. Maven would download many things that it won't really need, that then again must be deleted from /root/.m2 Would there be general interest in adding the Fuseki2 docker image to Jena, or have I made another novelty thing :) ? (Here is Virtuoso: https://registry.hub.docker.com/u/stain/virtuoso/ If it went in, it would need to add that kind of Maven polishing, of course, so that it works smoothly in releases. I don't think I would try to get Maven to actually build the Docker image, that would put in quite strong OS requirements. Presumably - if Jena was to upload such an image officially - it would then have to be voted over as (part of) a release, even though it would primarily contain the fuseki2 dist. Would there be licensing issues over the docker image depending on Linux and OpenJDK (or Oracle JDK)? Docker folks seem to just not worry much about licensing :-/ There could be some issues with layering - my image now contain one layer which adds both pwgen (GPL) and Fuseki (Apache 2.0) - but that's easy enough to split. (or FreeBSD to the rescue!?) There's also this docker image by Tazro Inutano Ohta: https://registry.hub.docker.com/u/inutano/jena/dockerfile/ which simply provides the official Jena distribution under /apache-jena-2.12.1/ With a bit more work this could allow usage of the tdb commands as-is. -- Stian Soiland-Reyes Apache Taverna (incubating) http://orcid.org/0000-0001-9842-9718
