Dear all (especially Raffaele),

I have now finished implementing a bulk-loading API for quickly dumping
big RDF datasets into a KiWi/Marmotta triplestore. All code is located
in libraries/kiwi/kiwi-loader. There is a command line tool (implemented
mostly by Jakob) called KiWiLoader as well as an API implementation.
Since it is probably more relevant for the rest of you, I'll explain in
the following the API implementation:

Bulk loading is implemented as Sesame RDFHandler without the whole
Repository or SAIL API to avoid additional overhead when importing. This
means that you can use the bulk loading API with any Sesame component
that takes an RDFHandler, most importantly the RIO API. The code in the
KiWiHandlerTest illustrates its use:

    KiWiHandler handler;
    if(dialect instanceof PostgreSQLDialect) {
        handler = new KiWiPostgresHandler(store, new
KiWiLoaderConfiguration());
    } else if(dialect instanceof MySQLDialect) {
        handler = new KiWiMySQLHandler(store, new
KiWiLoaderConfiguration());
    } else {
        handler = new KiWiHandler(store,new KiWiLoaderConfiguration());
    }

    // bulk import
    RDFParser parser = Rio.createParser(RDFFormat.RDFXML);
    parser.setRDFHandler(handler);
    parser.parse(in,baseUri);

KiWiHandler implementations process statements in a streaming manner and
directly dump them into the database. Since for performance reasons this
is done without some of the checks implemented in the normal repository,
you should not run this process in parallel with other processes or
threads operating on the same triple store (maybe I will implement
database locking or so to avoid it).

What you can also see is that for PostgreSQL and MySQL there are
specialized handler implementations. These make use of the special
bulk-loading constructs offered by PostgreSQL (COPY IN) and MySQL (LOAD
LOCAL INFILE) and are implemented by generating an in-memory CSV stream
that is directly sent to the database connection. They also disable
indexes before import and re-enable them when import is finished. To the
best of my knowledge, this is the fastest we can get with the current
data model.

I currently have a process running for importing the whole Freebase dump
into a PostgreSQL database backend. Server is powerful but uses an
ordinary harddisk (no SSD). Up until now it seems to work reliably with
(after 26 hours and 600 million triples) an average throughput of
6.000-7.000 triples/sec and peaks of up to 30.000 triples/sec. Unlike
other triple stores like BigData, throughput remains more or less
constant even when the size of the database increases - which shows the
power of the relational database backend. For your information, I have
attached a diagram showing some performance statistics over time. 

Take these figures with a grain of salt: such statistics heavily depend
on the way the import data is structured, e.g. if it has a high locality
(triples ordered by subject) and how large literals are in average.
Freebase typically includes quite large literals (e.g. Wikipedia
abstracts).

I'd be glad if those of you with big datasets (Raffaele?) could play
around a bit with this implementation, especially for MySQL, which I did
not test extensively (just the unit test).

A typical call of KiWiLoader would be:

java -jar target/kiwi-loader-3.2.0-SNAPSHOT.one-jar.jar
-S /tmp/loader.png -D postgresql -d
jdbc:postgresql://localhost:5432/freebase?prepareThreshold=3 -U marmotta
-P marmotta -z -f text/turtle -i freebase-rdf-2013-11-10-00-00_fixed.gz 

The option "-S" enables statistics sampling and creates diagrams similar
to the one I attached. "-z"/"-j" select gzip/bzip2 compressed input,
"-f" specifies input format, "-i" the input file, "-D" the database
dialect. Rest is self-explanatory ;-)

My next step is to look again at querying (API and especially SPARQL) to
see how we can improve performance once such big datasets have been
loaded ;-)

Greetings,

Sebasitan

-- 
Dr. Sebastian Schaffert
Chief Technology Officer
Redlink GmbH

Reply via email to