Re: Bulk Loading: Introducing KiWiHandler and KiWiLoader

Raffaele Palmieri Thu, 28 Nov 2013 01:35:18 -0800

Hi all,
after some fixes and tests, I note that there is some problem about
MARMOTTA-352 <https://issues.apache.org/jira/browse/MARMOTTA-352> attached
to MARMOTTA-245.
>From my tests with MySql, the procedure starts good and after less than a
minute performs first commit with following statistics:
"imported 100,0 K triples; statistics: 2.036/sec, 2.314/sec (last min),
2.314/sec (last hour)"
I am using jamendo.rdf.gz for my tests(http://dbtune.org/jamendo/).
After that, slows down, most likely for very poor Node Cache Hits, I
suspect that there is a problem with EhCache.
Attached is statistics diagram,
cheers,
Raffaele.




On 22 November 2013 16:49, Raffaele Palmieri <[email protected]>wrote:

> Dear all,
> first of all I'm happy that Marmotta is now a top level Apache project, 
> congratulations
> to all of you that worked very hard in the last months.
> For Sebastian, surely I could test mysql implementation; from your tests, I
> seem to have figured out that database backend is better for storage than Big
> Data triple store, so the first part of issue MARMOTTA-245 could be
> closed, clearly after positive tests with supported DBs, at the least for
> the part of storage, using db native methods for bulk import .
> I don't see any attached diagram about performances, is it elsewhere?
> Best,
> Raffaele.
>
>
>
> On 22 November 2013 15:38, Sebastian Schaffert <[email protected]>wrote:
>
>> Dear all (especially Raffaele),
>>
>> I have now finished implementing a bulk-loading API for quickly dumping
>> big RDF datasets into a KiWi/Marmotta triplestore. All code is located
>> in libraries/kiwi/kiwi-loader. There is a command line tool (implemented
>> mostly by Jakob) called KiWiLoader as well as an API implementation.
>> Since it is probably more relevant for the rest of you, I'll explain in
>> the following the API implementation:
>>
>> Bulk loading is implemented as Sesame RDFHandler without the whole
>> Repository or SAIL API to avoid additional overhead when importing. This
>> means that you can use the bulk loading API with any Sesame component
>> that takes an RDFHandler, most importantly the RIO API. The code in the
>> KiWiHandlerTest illustrates its use:
>>
>>     KiWiHandler handler;
>>     if(dialect instanceof PostgreSQLDialect) {
>>         handler = new KiWiPostgresHandler(store, new
>> KiWiLoaderConfiguration());
>>     } else if(dialect instanceof MySQLDialect) {
>>         handler = new KiWiMySQLHandler(store, new
>> KiWiLoaderConfiguration());
>>     } else {
>>         handler = new KiWiHandler(store,new KiWiLoaderConfiguration());
>>     }
>>
>>     // bulk import
>>     RDFParser parser = Rio.createParser(RDFFormat.RDFXML);
>>     parser.setRDFHandler(handler);
>>     parser.parse(in,baseUri);
>>
>> KiWiHandler implementations process statements in a streaming manner and
>> directly dump them into the database. Since for performance reasons this
>> is done without some of the checks implemented in the normal repository,
>> you should not run this process in parallel with other processes or
>> threads operating on the same triple store (maybe I will implement
>> database locking or so to avoid it).
>>
>> What you can also see is that for PostgreSQL and MySQL there are
>> specialized handler implementations. These make use of the special
>> bulk-loading constructs offered by PostgreSQL (COPY IN) and MySQL (LOAD
>> LOCAL INFILE) and are implemented by generating an in-memory CSV stream
>> that is directly sent to the database connection. They also disable
>> indexes before import and re-enable them when import is finished. To the
>> best of my knowledge, this is the fastest we can get with the current
>> data model.
>>
>> I currently have a process running for importing the whole Freebase dump
>> into a PostgreSQL database backend. Server is powerful but uses an
>> ordinary harddisk (no SSD). Up until now it seems to work reliably with
>> (after 26 hours and 600 million triples) an average throughput of
>> 6.000-7.000 triples/sec and peaks of up to 30.000 triples/sec. Unlike
>> other triple stores like BigData, throughput remains more or less
>> constant even when the size of the database increases - which shows the
>> power of the relational database backend. For your information, I have
>> attached a diagram showing some performance statistics over time.
>>
>> Take these figures with a grain of salt: such statistics heavily depend
>> on the way the import data is structured, e.g. if it has a high locality
>> (triples ordered by subject) and how large literals are in average.
>> Freebase typically includes quite large literals (e.g. Wikipedia
>> abstracts).
>>
>> I'd be glad if those of you with big datasets (Raffaele?) could play
>> around a bit with this implementation, especially for MySQL, which I did
>> not test extensively (just the unit test).
>>
>> A typical call of KiWiLoader would be:
>>
>> java -jar target/kiwi-loader-3.2.0-SNAPSHOT.one-jar.jar
>> -S /tmp/loader.png -D postgresql -d
>> jdbc:postgresql://localhost:5432/freebase?prepareThreshold=3 -U marmotta
>> -P marmotta -z -f text/turtle -i freebase-rdf-2013-11-10-00-00_fixed.gz
>>
>> The option "-S" enables statistics sampling and creates diagrams similar
>> to the one I attached. "-z"/"-j" select gzip/bzip2 compressed input,
>> "-f" specifies input format, "-i" the input file, "-D" the database
>> dialect. Rest is self-explanatory ;-)
>>
>> My next step is to look again at querying (API and especially SPARQL) to
>> see how we can improve performance once such big datasets have been
>> loaded ;-)
>>
>> Greetings,
>>
>> Sebasitan
>>
>> --
>> Dr. Sebastian Schaffert
>> Chief Technology Officer
>> Redlink GmbH
>>
>>
>

Re: Bulk Loading: Introducing KiWiHandler and KiWiLoader

Reply via email to