Re: Bulk Loading: Introducing KiWiHandler and KiWiLoader

Raffaele Palmieri Fri, 22 Nov 2013 07:50:38 -0800

Dear all,
first of all I'm happy that Marmotta is now a top level Apache
project, congratulations
to all of you that worked very hard in the last months.
For Sebastian, surely I could test mysql implementation; from your tests, I
seem to have figured out that database backend is better for storage than Big
Data triple store, so the first part of issue MARMOTTA-245 could be closed,
clearly after positive tests with supported DBs, at the least for the part
of storage, using db native methods for bulk import .
I don't see any attached diagram about performances, is it elsewhere?
Best,
Raffaele.




On 22 November 2013 15:38, Sebastian Schaffert <[email protected]>wrote:

> Dear all (especially Raffaele),
>
> I have now finished implementing a bulk-loading API for quickly dumping
> big RDF datasets into a KiWi/Marmotta triplestore. All code is located
> in libraries/kiwi/kiwi-loader. There is a command line tool (implemented
> mostly by Jakob) called KiWiLoader as well as an API implementation.
> Since it is probably more relevant for the rest of you, I'll explain in
> the following the API implementation:
>
> Bulk loading is implemented as Sesame RDFHandler without the whole
> Repository or SAIL API to avoid additional overhead when importing. This
> means that you can use the bulk loading API with any Sesame component
> that takes an RDFHandler, most importantly the RIO API. The code in the
> KiWiHandlerTest illustrates its use:
>
>     KiWiHandler handler;
>     if(dialect instanceof PostgreSQLDialect) {
>         handler = new KiWiPostgresHandler(store, new
> KiWiLoaderConfiguration());
>     } else if(dialect instanceof MySQLDialect) {
>         handler = new KiWiMySQLHandler(store, new
> KiWiLoaderConfiguration());
>     } else {
>         handler = new KiWiHandler(store,new KiWiLoaderConfiguration());
>     }
>
>     // bulk import
>     RDFParser parser = Rio.createParser(RDFFormat.RDFXML);
>     parser.setRDFHandler(handler);
>     parser.parse(in,baseUri);
>
> KiWiHandler implementations process statements in a streaming manner and
> directly dump them into the database. Since for performance reasons this
> is done without some of the checks implemented in the normal repository,
> you should not run this process in parallel with other processes or
> threads operating on the same triple store (maybe I will implement
> database locking or so to avoid it).
>
> What you can also see is that for PostgreSQL and MySQL there are
> specialized handler implementations. These make use of the special
> bulk-loading constructs offered by PostgreSQL (COPY IN) and MySQL (LOAD
> LOCAL INFILE) and are implemented by generating an in-memory CSV stream
> that is directly sent to the database connection. They also disable
> indexes before import and re-enable them when import is finished. To the
> best of my knowledge, this is the fastest we can get with the current
> data model.
>
> I currently have a process running for importing the whole Freebase dump
> into a PostgreSQL database backend. Server is powerful but uses an
> ordinary harddisk (no SSD). Up until now it seems to work reliably with
> (after 26 hours and 600 million triples) an average throughput of
> 6.000-7.000 triples/sec and peaks of up to 30.000 triples/sec. Unlike
> other triple stores like BigData, throughput remains more or less
> constant even when the size of the database increases - which shows the
> power of the relational database backend. For your information, I have
> attached a diagram showing some performance statistics over time.
>
> Take these figures with a grain of salt: such statistics heavily depend
> on the way the import data is structured, e.g. if it has a high locality
> (triples ordered by subject) and how large literals are in average.
> Freebase typically includes quite large literals (e.g. Wikipedia
> abstracts).
>
> I'd be glad if those of you with big datasets (Raffaele?) could play
> around a bit with this implementation, especially for MySQL, which I did
> not test extensively (just the unit test).
>
> A typical call of KiWiLoader would be:
>
> java -jar target/kiwi-loader-3.2.0-SNAPSHOT.one-jar.jar
> -S /tmp/loader.png -D postgresql -d
> jdbc:postgresql://localhost:5432/freebase?prepareThreshold=3 -U marmotta
> -P marmotta -z -f text/turtle -i freebase-rdf-2013-11-10-00-00_fixed.gz
>
> The option "-S" enables statistics sampling and creates diagrams similar
> to the one I attached. "-z"/"-j" select gzip/bzip2 compressed input,
> "-f" specifies input format, "-i" the input file, "-D" the database
> dialect. Rest is self-explanatory ;-)
>
> My next step is to look again at querying (API and especially SPARQL) to
> see how we can improve performance once such big datasets have been
> loaded ;-)
>
> Greetings,
>
> Sebasitan
>
> --
> Dr. Sebastian Schaffert
> Chief Technology Officer
> Redlink GmbH
>
>

Re: Bulk Loading: Introducing KiWiHandler and KiWiLoader

Reply via email to