Re: Fuseki 2 docker image - some questions

Stian Soiland-Reyes Wed, 28 Jan 2015 04:24:22 -0800

On 28 January 2015 at 10:34, Andy Seaborne <[email protected]> wrote:

> tdbloader does not do better when it's an existing, non-empty database.  It
> avoids some transactional scaling issues but otherwise uploading to the
> server live is much the same.


Uploading many files of several GB of data over HTTP sounds fragile to
me.. and anyway, at least in my case, I don't have those files on the
machine with the browser :)

tdbloader2 might be good for the biggest stuff, though, as for me at
least it seems to give a big improvement in performance - but at the
risk of not loading anything at all if something goes wrong with one
of the files, as it does indexing in the end. (?)


BTW - is it normal that tdbloader performance decrease as more triples
go in? Getting more expensive to maintain the indexes or match
existing identifiers?

INFO  Add: 338,100,000 triples (Batch: 2,853 / Avg: 6,824)
INFO  Add: 351,550,000 triples (Batch: 3,943 / Avg: 6,503)
INFO  Add: 357,200,000 triples (Batch: 7,685 / Avg: 6,195)
INFO  Add: 371,900,000 triples (Batch: 6,900 / Avg: 5,565)
INFO  Add: 386,000,000 triples (Batch: 4,506 / Avg: 5,094)

This is from before I understood JVM_ARGS so it's probably memory
unbounded (my tdbloader didn't set -Xmx), using about 5 GB or so of
heap.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND                                 20203 root      20   0 21.655g
5.066g 3.225g S  30.6 74.7 257:36.85 java

(it's IO-bound - I don't have those fancy SSD raids at home :)


> Don't follow.  config-tdb-dir is that minimal config isn't it?
> The templates take NAME and DIRectory.  That's it.

Right, I can replace those variables with sed, so that should work - I
guess from the dist I could just unzip the template from the
fuseki-server.jar.  (Or should it perhaps better be exposed in the
dist?)


>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-fuseki-dist/2.0.0-SNAPSHOT/maven-metadata.xml
>
>
> There is a timestamp and an incremental count (ATM "21")
> jena-fuseki-dist-2.0.0-20150128.100051-21.zip

I already parse that with xpath:

https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki-docker/Dockerfile#L52

.. but a more robust way would be to temporary install mvn and have a
dummy pom.xml which <version> is updated from the parent. I went with
xpath for now as that would need more cleanup..  Maven would download
many things that it won't really need, that then again must be deleted
from /root/.m2


Would there be general interest in adding the Fuseki2 docker image to
Jena, or have I made another novelty thing :) ? (Here is Virtuoso:
https://registry.hub.docker.com/u/stain/virtuoso/


If it went in, it would need to add that kind of Maven polishing, of
course, so that it works smoothly in releases. I don't think I would
try to get Maven to actually build the Docker image, that would put in
quite strong OS requirements.

Presumably - if Jena was to upload such an image officially - it would
then have to be voted over as (part of) a release, even though it
would primarily contain the fuseki2 dist.


Would there be licensing issues over the docker image depending on
Linux and OpenJDK (or Oracle JDK)?
Docker folks seem to just not worry much about licensing :-/

There could be some issues with layering - my image now contain one
layer which adds both pwgen (GPL) and Fuseki (Apache 2.0) - but that's
easy enough to split.

(or FreeBSD to the rescue!?)


There's also this docker image by Tazro Inutano Ohta:
  https://registry.hub.docker.com/u/inutano/jena/dockerfile/
which simply provides the official Jena distribution under
  /apache-jena-2.12.1/

With a bit more work this could allow usage of the tdb commands as-is.

-- 
Stian Soiland-Reyes
Apache Taverna (incubating)
http://orcid.org/0000-0001-9842-9718

Re: Fuseki 2 docker image - some questions

Reply via email to