Hi all, after some fixes and tests, I note that there is some problem about MARMOTTA-352 <https://issues.apache.org/jira/browse/MARMOTTA-352> attached to MARMOTTA-245. >From my tests with MySql, the procedure starts good and after less than a minute performs first commit with following statistics: "imported 100,0 K triples; statistics: 2.036/sec, 2.314/sec (last min), 2.314/sec (last hour)" I am using jamendo.rdf.gz for my tests(http://dbtune.org/jamendo/). After that, slows down, most likely for very poor Node Cache Hits, I suspect that there is a problem with EhCache. Attached is statistics diagram, cheers, Raffaele.
On 22 November 2013 16:49, Raffaele Palmieri <[email protected]>wrote: > Dear all, > first of all I'm happy that Marmotta is now a top level Apache project, > congratulations > to all of you that worked very hard in the last months. > For Sebastian, surely I could test mysql implementation; from your tests, I > seem to have figured out that database backend is better for storage than Big > Data triple store, so the first part of issue MARMOTTA-245 could be > closed, clearly after positive tests with supported DBs, at the least for > the part of storage, using db native methods for bulk import . > I don't see any attached diagram about performances, is it elsewhere? > Best, > Raffaele. > > > > On 22 November 2013 15:38, Sebastian Schaffert <[email protected]>wrote: > >> Dear all (especially Raffaele), >> >> I have now finished implementing a bulk-loading API for quickly dumping >> big RDF datasets into a KiWi/Marmotta triplestore. All code is located >> in libraries/kiwi/kiwi-loader. There is a command line tool (implemented >> mostly by Jakob) called KiWiLoader as well as an API implementation. >> Since it is probably more relevant for the rest of you, I'll explain in >> the following the API implementation: >> >> Bulk loading is implemented as Sesame RDFHandler without the whole >> Repository or SAIL API to avoid additional overhead when importing. This >> means that you can use the bulk loading API with any Sesame component >> that takes an RDFHandler, most importantly the RIO API. The code in the >> KiWiHandlerTest illustrates its use: >> >> KiWiHandler handler; >> if(dialect instanceof PostgreSQLDialect) { >> handler = new KiWiPostgresHandler(store, new >> KiWiLoaderConfiguration()); >> } else if(dialect instanceof MySQLDialect) { >> handler = new KiWiMySQLHandler(store, new >> KiWiLoaderConfiguration()); >> } else { >> handler = new KiWiHandler(store,new KiWiLoaderConfiguration()); >> } >> >> // bulk import >> RDFParser parser = Rio.createParser(RDFFormat.RDFXML); >> parser.setRDFHandler(handler); >> parser.parse(in,baseUri); >> >> KiWiHandler implementations process statements in a streaming manner and >> directly dump them into the database. Since for performance reasons this >> is done without some of the checks implemented in the normal repository, >> you should not run this process in parallel with other processes or >> threads operating on the same triple store (maybe I will implement >> database locking or so to avoid it). >> >> What you can also see is that for PostgreSQL and MySQL there are >> specialized handler implementations. These make use of the special >> bulk-loading constructs offered by PostgreSQL (COPY IN) and MySQL (LOAD >> LOCAL INFILE) and are implemented by generating an in-memory CSV stream >> that is directly sent to the database connection. They also disable >> indexes before import and re-enable them when import is finished. To the >> best of my knowledge, this is the fastest we can get with the current >> data model. >> >> I currently have a process running for importing the whole Freebase dump >> into a PostgreSQL database backend. Server is powerful but uses an >> ordinary harddisk (no SSD). Up until now it seems to work reliably with >> (after 26 hours and 600 million triples) an average throughput of >> 6.000-7.000 triples/sec and peaks of up to 30.000 triples/sec. Unlike >> other triple stores like BigData, throughput remains more or less >> constant even when the size of the database increases - which shows the >> power of the relational database backend. For your information, I have >> attached a diagram showing some performance statistics over time. >> >> Take these figures with a grain of salt: such statistics heavily depend >> on the way the import data is structured, e.g. if it has a high locality >> (triples ordered by subject) and how large literals are in average. >> Freebase typically includes quite large literals (e.g. Wikipedia >> abstracts). >> >> I'd be glad if those of you with big datasets (Raffaele?) could play >> around a bit with this implementation, especially for MySQL, which I did >> not test extensively (just the unit test). >> >> A typical call of KiWiLoader would be: >> >> java -jar target/kiwi-loader-3.2.0-SNAPSHOT.one-jar.jar >> -S /tmp/loader.png -D postgresql -d >> jdbc:postgresql://localhost:5432/freebase?prepareThreshold=3 -U marmotta >> -P marmotta -z -f text/turtle -i freebase-rdf-2013-11-10-00-00_fixed.gz >> >> The option "-S" enables statistics sampling and creates diagrams similar >> to the one I attached. "-z"/"-j" select gzip/bzip2 compressed input, >> "-f" specifies input format, "-i" the input file, "-D" the database >> dialect. Rest is self-explanatory ;-) >> >> My next step is to look again at querying (API and especially SPARQL) to >> see how we can improve performance once such big datasets have been >> loaded ;-) >> >> Greetings, >> >> Sebasitan >> >> -- >> Dr. Sebastian Schaffert >> Chief Technology Officer >> Redlink GmbH >> >> >
