Re: Bulk load of triples on DB

Sebastian Schaffert Thu, 23 May 2013 06:04:15 -0700

Hi Raffaele,

the idea was anyways to allow different backends besides KiWi, because each
has its advantages and disadvantages (KiWi's advantages are the versioning
and the reasoner). The issue is documented under


https://issues.apache.org/jira/browse/MARMOTTA-85

and the individual backends have subsequent numbers. See e.g.

https://issues.apache.org/jira/browse/MARMOTTA-89

for the SDB backend implementation.

Changing backends is currently not possible, but it is foreseen in the
architecture and it would take me about one day of work to change the
platform in a way that other backends can be used. The main change will be
in the SesameServiceImpl which sets up the underlying triple store. The
initialisation method for this service stacks together different sails
depending on the configuration and is already very modular. The only thing
that is currently hardcoded there is the initialisation of a new KiWiStore,
but in principle it could be any Sesame Sail.

But there are some consequences and dependencies, e.g. the
marmotta-versioning and marmotta-reasoner modules cannot be used if the
backend is not KiWi, and I need to find a clean way to model these
dependencies (Maven is unfortunately probably not enough, because several
backends could be on the classpath and only one backend selected - on the
other hand we could simply create different backend configurations in Maven
that only include one backend to be used - we will see).

If you want to try with SDB and TDB, the first step would be to implement a
clean wrapper that allows accessing Jena through the Sesame SAIL API. Peter
Ansell has already worked on such adapters:

https://github.com/ansell/JenaSesame

Maybe this would be a good starting point. I will in parallel try to work
on modularizing the backends. Not sure when I will be able to finish this,
because other things are currently on my priority list...

Greetings,

Sebastian


2013/5/23 Raffaele Palmieri <raffaele.palmi...@gmail.com>

> Hi Sebastian, below are some considerations that induce me to think that
> Jena SDB(or TDB) could be a better solution, but I understand that's a big
> impact on codebase, and so I would go cautious.
>
> On 23 May 2013 12:20, Sebastian Schaffert <sebastian.schaff...@gmail.com
> >wrote:
>
> > Hi Raffaele,
> >
> >
> > 2013/5/22 Raffaele Palmieri <raffaele.palmi...@gmail.com>
> >
> > > On 22 May 2013 15:04, Andy Seaborne <a...@apache.org> wrote:
> > >
> > > > What is the current loading rate?
> > > >
> > >
> > > Tried a test with a graph of 661 nodes and 957 triples: it took about
> 18
> > > sec. So, looking the triples the medium rate is 18.8 ms per triple;
> > tested
> > > on Tomcat with maximum size of 1.5 Gb.
> > >
> > >
> > This is a bit too small for a real test, because you will have a high
> > influence of side effects (like cache initialisation). I have done some
> > performance comparisons with importing about 10% of GeoNames (about 15
> > million triples, 1.5 million resources). The test uses a specialised
> > parallel importer that was configured to run 8 importing threads in
> > parallel. Here are some figures on different hardware:
> > *- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13
> seconds,
> > 8 in parallel). In case of VMWare, the CPU is waiting most of the time
> for
> > I/O, so apparently the harddisk is slow. Could also be related to an
> older
> > Linux kernel or the host the instance is running on (might not have 4
> > physical CPUs assigned to the instance).*
> > *- QEmu**
> >
>
> >    -
> >
> >    4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds,
> 8
> >    in parallel). The change to SSD does not deliver the expected
> > performance
> >    gain, the limit is mostly the CPU power (load always between 90-100%).
> >    However, the variance in the average time for 100 is less, so the
> > results
> >    are more stable over time.
> >
> > *
> >  - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
> > seconds, 8 in parallel). Running on physical hardware obviously shows the
> > highest performance. All 8 CPUs between 85-95% load.*
> >
> > In this setup, my observation was that about 80% of the CPU time is
> > actually spent in Postgres, and most of the database time in SELECTs (not
> > INSERTs) because of checking if a node or triple already exists. So the
> > highest performance gain will be achieved by reducing the load on the
> > database. There is already a quite good caching system in place (using
> > EHCache), unfortunately the caching cannot solve the issue of checking
> for
> > non-existance (a cache can only help when checking for existance). This
> is
> > why especially the initial import is comparably slow.
> >
> >
>   You are right that my test of about 1000 triples was limitative,
> specially with lower resources than yours; but with the same graph and the
> same resources Jena SDB offers better performance, however I agree with you
> that we will need more benchmarks.
> In favor of Jena, both SDB and TDB have already a command line tool to
> access directly to the database.
>
>
> > Conceptually, when inserting a triple, the workflow currently looks as
> > follows:
> >
> > 1. for each node of subject, predicate, object, context:
> > 1.1. check for existence of node
> > 1.1.a node exists in cache, return its database ID
> > 1.1.b node does not exist in cache, look in the database if it exists
> there
> > (SELECT) and return its ID, or null
> > 1.2. if the database ID is null:
> > 1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
> > sequence simulation table (MySQL: SELECT) to get the next database ID and
> > assign it to the node
> > 1.2.2 store the node in the database (INSERT) and add it to the cache
> > 2. check for existance of triple:
> > 2.a triple exists in cache, return its database ID
> > 2.b triple does not exist in cache, look in the database if it exists
> there
> > (SELECT) and return its ID, or null
> > 3. if the triple ID is null:
> > 3.1 query the sequence or the sequence simulation table (MySQL) to get
> the
> > next database ID for triples and assign it to the triple
> > 3.2 store the triple in the database (INSERT) and add it to the cache
> >
> > So, in the worst case (i.e. all nodes are new and the triple is new, so
> > nothing can be answered from the cache) you will have:
> > - 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
> > - 4 SELECT commands for existence checking (three nodes, 1 triple), these
> > are comparably expensive
> > - 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap
> or
> > 4 SELECT from table commands in case of MySQL, comparably cheap (but not
> as
> > good as a real sequence)
> > what is even worse is that the INSERT and SELECT commands will be
> > interwoven, i.e. there will be alternating SELECTs and INSERTS, which
> > databases do not really like.
> >
> >
> This workflow is for duplication check,  from documentation I see that Jena
> SDB Loader already makes duplication suppression.
>
>
> > To optimize the performance, the best options are therefore:
> > - avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
> > least for each triple batch the node insertions)
> > - avoiding the comparably expensive existence checks (e.g. other way of
> > caching/looking up that supports checking for non-existance)
> >
> > If bulk import then is still slow, it might make sense looking into the
> > database specific bulk loading commands you suggested.
> >
> > If I find some time, I might be able to look into the first optimization
> > (i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
> > certain improvement can already be achieved by optimizing this per
> triple.
> >
> > If you want to try out more sophisticated improvements or completely
> > alternate ways of bulk loading, I would be very happy to see it. Just
> make
> > sure the database schema and integrity constraints are kept as they are
> and
> > the rest will work. The main constraint is that nodes are unique (i.e.
> each
> > URI or Literal has exactly one database row) and non-deleted triples are
> > unique (i.e. each non-deleted triple has exactly one database ID).
> >
> >
> > > >
> > > > The Jena SDB bulk loader may have some ideas you can draw on.  It
> bulk
> > > > loads a chunk (typically 10K) of triples at a time uses DB temporary
> > > tables
> > > > as working storage.  The database layout is a triples+nodes database
> > > > layout.  SDB manipulates those tables in the DB to find new nodes to
> > add
> > > to
> > > > the node table and new triples to add to the triples table as single
> > SQL
> > > > operations.  The original designer may be around on dev@jena.a.o
> > > >
> > > >
> > > This design looks interesting and it seems to be a similar approach to
> my
> > > idea, it could be investigated. In the case, can we think about use of
> > Jena
> > > SDB in Marmotta?
> > >
> > >
> > This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
> > Sail, and actually there is already an issue for this in Jira. However,
> > when doing this you will loose support for the Marmotta/KiWi Reasoner and
> > Versioning.
>
>
> That would be a good thing that avoids doing too much refactoring. For KiWi
> Reasoner and Versioning a move on to the Jena RDF API could be needed.
>
>
> > My suggestion would instead be to look how Jena SDB is
> > implementing the bulk import and try a similar solution.
>
>
> We could implement the same approach from scratch(with queue, chunk and
> threads), combining with the use of JDBC batch processing, we will obtain a
> better result, but hasn't it more sense try to use directly already
> implemented solution?
>
>
> > But if we start
> > with the optimizations I have already suggested, there might be a huge
> gain
> > already. It just has not been in our focus right now, because the
> scenarios
> > we are working on do not require bulk-loading huge amounts of data. Data
> > consistency and parallel access was more important to us. But it would
> be a
> > nice feature to be able to run a local copy of GeoNames or DBPedia using
> > Marmotta ;-)
>
>
>   Yes, it would be nice :)
>
>
> >
>
>
> > Greetings,
> >
> > Sebastian
> >
>
> Greetings
> Raffaele.
>

Re: Bulk load of triples on DB

Reply via email to