Hi Raffaele, thanks for your ideas. I have been spending a lot of time thinking on how to improve the performance of bulk imports. There are currently several reasons why a bulk import is slow: 1) Marmotta uses (database) transactions to ensure a good behaviour and consistent data in highly parallel environments; transactions, however, introduce a big performance impact especially when they get long (because the database needs to keep a journal and merge it at the end) 2) Marmotta needs to check before creating a node or triple if this node or triple already exists, because you don't want to have duplicates 3) Marmotta needs to issue a single SQL command for every inserted triple (because of 2)
3) could be addressed as you say, but even the Java JDBC API offers "batch commands" that would improve performance, i.e. if you manage to run the same statement in a sequence many times, the performance will be greatly optimized. Unfortunately, I was not able to do this because I don't have a good solution for 2). 3) depends on 2) because for every inserted triple I need to check if the nodes already exist, so there will be select statements before the insert statements. 2) is a really tricky issue, because the check is needed to ensure data integrity. I have been thinking about different options here. Keep in mind that two tables are affected (triples and nodes) and both need to be handled in a different way: - if you know that the *triples* do not yet exist (e.g. empty database or the user assures that they do not exist) you can avoid the check for triple existance, but the node check is still needed because several triples might refer to the same node - if the dataset is reasonably small, you can implement the node check using an in-memory hashtable, which would be very fast; unfortunately you don't know this in advance, and once a node exists the Marmotta caching backends anyways takes care of it as long as Marmotta has memory, so the expensive part is checking for non-existance rather than for existance - you could also implement a persistent hash map (like MapDB) to keep track of the node ids, but I doubt it would give you much benefit over the database lookup once the dataset is bug Even if you implement this solution, you would need a two-pass import to achieve a bulk-load-behaviour in the database, because two tables are affected, i.e. in the first pass you would import only the nodes, and in the second pass only the triples. Another possibility is to relax the data integrity constraints a bit (e.g. allowing the same node to exist with different IDs), but I cannot foresee the consequences of such a choice - it is against the data model. 1) is easy to solve by putting Marmotta in some kind of "maintenance mode", i.e. when bulk importing there is an exclusive lock on the database for the import process. Another (similar) solution is to provide a separate command line tool for importing into a database while Marmotta is not running at all. The solution I was going to implement as a result of this thinking is as follows: - a separate command-line tool that accesses the database directly - when importing, all nodes and triples are first only created in-memory and stored in standard Java data structures (or in a temporary log on the file system) - when the import is finished, first all nodes are bulk-inserted and the Java objects get IDs - second, all triples are bulk-imported with the proper node ids If you want to try out different solutions, I'd be happy if this problem can be solved ;-) Greetings, Sebastian 2013/5/21 Raffaele Palmieri <raffaele.palmi...@gmail.com> > Hi to all, > I would propose a little a change to architecture of Importer Service. > Actually for every triple there are single SQL commands invoked from > SailConnectionBase that persist triple informations on DB. That's probably > one of major causes of delay of import operation. > I thought a way to optimize that operation, building for example a csv, > tsv, or *sv file that the major part of RDBMS are able to import in an > optimized way. > For example, MySQL has Load Data Infile command, Postgresql has Copy > command, H2 has Insert into ... Select from Csvread. > I am checking if this modification is feasible; it surely will need a > specialization of sql dialect depending on used RDBMS. > What do you think about? would it have too much impacts? > Regards, > Raffaele. >