Re: Bulk load of triples on DB

Sebastian Schaffert Wed, 22 May 2013 01:01:00 -0700

Hi Raffaele,

thanks for your ideas. I have been spending a lot of time thinking on how
to improve the performance of bulk imports. There are currently several
reasons why a bulk import is slow:
1) Marmotta uses (database) transactions to ensure a good behaviour and
consistent data in highly parallel environments; transactions, however,
introduce a big performance impact especially when they get long (because
the database needs to keep a journal and merge it at the end)
2) Marmotta needs to check before creating a node or triple if this node or
triple already exists, because you don't want to have duplicates
3) Marmotta needs to issue a single SQL command for every inserted triple
(because of 2)


3) could be addressed as you say, but even the Java JDBC API offers "batch
commands" that would improve performance, i.e. if you manage to run the
same statement in a sequence many times, the performance will be greatly
optimized. Unfortunately, I was not able to do this because I don't have a
good solution for 2). 3) depends on 2) because for every inserted triple I
need to check if the nodes already exist, so there will be select
statements before the insert statements.

2) is a really tricky issue, because the check is needed to ensure data
integrity. I have been thinking about different options here. Keep in mind
that two tables are affected (triples and nodes) and both need to be
handled in a different way:
- if you know that the *triples* do not yet exist (e.g. empty database or
the user assures that they do not exist) you can avoid the check for triple
existance, but the node check is still needed because several triples might
refer to the same node
- if the dataset is reasonably small, you can implement the node check
using an in-memory hashtable, which would be very fast; unfortunately you
don't know this in advance, and once a node exists the Marmotta caching
backends anyways takes care of it as long as Marmotta has memory, so the
expensive part is checking for non-existance rather than for existance
- you could also implement a persistent hash map (like MapDB) to keep track
of the node ids, but I doubt it would give you much benefit over the
database lookup once the dataset is bug
Even if you implement this solution, you would need a two-pass import to
achieve a bulk-load-behaviour in the database, because two tables are
affected, i.e. in the first pass you would import only the nodes, and in
the second pass only the triples.

Another possibility is to relax the data integrity constraints a bit (e.g.
allowing the same node to exist with different IDs), but I cannot foresee
the consequences of such a choice - it is against the data model.


1) is easy to solve by putting Marmotta in some kind of "maintenance mode",
i.e. when bulk importing there is an exclusive lock on the database for the
import process. Another (similar) solution is to provide a separate command
line tool for importing into a database while Marmotta is not running at
all.


The solution I was going to implement as a result of this thinking is as
follows:
- a separate command-line tool that accesses the database directly
- when importing, all nodes and triples are first only created in-memory
and stored in standard Java data structures (or in a temporary log on the
file system)
- when the import is finished, first all nodes are bulk-inserted and the
Java objects get IDs
- second, all triples are bulk-imported with the proper node ids


If you want to try out different solutions, I'd be happy if this problem
can be solved ;-)


Greetings,

Sebastian


2013/5/21 Raffaele Palmieri <raffaele.palmi...@gmail.com>

> Hi to all,
> I would propose a little a change to architecture of Importer Service.
> Actually for every triple there are single SQL commands invoked from
> SailConnectionBase that persist triple informations on DB. That's probably
> one of major causes of delay of import operation.
> I thought a way to optimize that operation, building for example a csv,
> tsv, or *sv file that the major part of RDBMS are able to import in an
> optimized way.
> For example, MySQL has Load Data Infile command, Postgresql has Copy
> command, H2 has Insert into ... Select from Csvread.
> I am checking if this modification is feasible; it surely will need a
> specialization of sql dialect depending on used RDBMS.
> What do you think about? would it have too much impacts?
> Regards,
> Raffaele.
>

Re: Bulk load of triples on DB

Reply via email to