Dear all, first of all I'm happy that Marmotta is now a top level Apache project, congratulations to all of you that worked very hard in the last months. For Sebastian, surely I could test mysql implementation; from your tests, I seem to have figured out that database backend is better for storage than Big Data triple store, so the first part of issue MARMOTTA-245 could be closed, clearly after positive tests with supported DBs, at the least for the part of storage, using db native methods for bulk import . I don't see any attached diagram about performances, is it elsewhere? Best, Raffaele.
On 22 November 2013 15:38, Sebastian Schaffert <[email protected]>wrote: > Dear all (especially Raffaele), > > I have now finished implementing a bulk-loading API for quickly dumping > big RDF datasets into a KiWi/Marmotta triplestore. All code is located > in libraries/kiwi/kiwi-loader. There is a command line tool (implemented > mostly by Jakob) called KiWiLoader as well as an API implementation. > Since it is probably more relevant for the rest of you, I'll explain in > the following the API implementation: > > Bulk loading is implemented as Sesame RDFHandler without the whole > Repository or SAIL API to avoid additional overhead when importing. This > means that you can use the bulk loading API with any Sesame component > that takes an RDFHandler, most importantly the RIO API. The code in the > KiWiHandlerTest illustrates its use: > > KiWiHandler handler; > if(dialect instanceof PostgreSQLDialect) { > handler = new KiWiPostgresHandler(store, new > KiWiLoaderConfiguration()); > } else if(dialect instanceof MySQLDialect) { > handler = new KiWiMySQLHandler(store, new > KiWiLoaderConfiguration()); > } else { > handler = new KiWiHandler(store,new KiWiLoaderConfiguration()); > } > > // bulk import > RDFParser parser = Rio.createParser(RDFFormat.RDFXML); > parser.setRDFHandler(handler); > parser.parse(in,baseUri); > > KiWiHandler implementations process statements in a streaming manner and > directly dump them into the database. Since for performance reasons this > is done without some of the checks implemented in the normal repository, > you should not run this process in parallel with other processes or > threads operating on the same triple store (maybe I will implement > database locking or so to avoid it). > > What you can also see is that for PostgreSQL and MySQL there are > specialized handler implementations. These make use of the special > bulk-loading constructs offered by PostgreSQL (COPY IN) and MySQL (LOAD > LOCAL INFILE) and are implemented by generating an in-memory CSV stream > that is directly sent to the database connection. They also disable > indexes before import and re-enable them when import is finished. To the > best of my knowledge, this is the fastest we can get with the current > data model. > > I currently have a process running for importing the whole Freebase dump > into a PostgreSQL database backend. Server is powerful but uses an > ordinary harddisk (no SSD). Up until now it seems to work reliably with > (after 26 hours and 600 million triples) an average throughput of > 6.000-7.000 triples/sec and peaks of up to 30.000 triples/sec. Unlike > other triple stores like BigData, throughput remains more or less > constant even when the size of the database increases - which shows the > power of the relational database backend. For your information, I have > attached a diagram showing some performance statistics over time. > > Take these figures with a grain of salt: such statistics heavily depend > on the way the import data is structured, e.g. if it has a high locality > (triples ordered by subject) and how large literals are in average. > Freebase typically includes quite large literals (e.g. Wikipedia > abstracts). > > I'd be glad if those of you with big datasets (Raffaele?) could play > around a bit with this implementation, especially for MySQL, which I did > not test extensively (just the unit test). > > A typical call of KiWiLoader would be: > > java -jar target/kiwi-loader-3.2.0-SNAPSHOT.one-jar.jar > -S /tmp/loader.png -D postgresql -d > jdbc:postgresql://localhost:5432/freebase?prepareThreshold=3 -U marmotta > -P marmotta -z -f text/turtle -i freebase-rdf-2013-11-10-00-00_fixed.gz > > The option "-S" enables statistics sampling and creates diagrams similar > to the one I attached. "-z"/"-j" select gzip/bzip2 compressed input, > "-f" specifies input format, "-i" the input file, "-D" the database > dialect. Rest is self-explanatory ;-) > > My next step is to look again at querying (API and especially SPARQL) to > see how we can improve performance once such big datasets have been > loaded ;-) > > Greetings, > > Sebasitan > > -- > Dr. Sebastian Schaffert > Chief Technology Officer > Redlink GmbH > >
