[Dspace-devel] DSpace 1.7 Scalability, Performance and Correctness

Graham Triggs Mon, 08 Nov 2010 09:59:37 -0800

Hi,

For anyone following the dspace-changelog email list, you may have noticed
one or two commits bearing my name. So, a little explanation of what has
been happening as we head into the release candidate / testathon.


The majority of commits in October were related to static analysis of the
code. There are three main (free) tools that handle static analysis -
Findbugs (probably the best at finding genuine issues to resolve), PMD
(finds more stuff, but a fair bit of noise and/or things that aren't that
critical), and Checkstyle (rarely anything serious, but nice to have
consistent style across the code). All of these have plugins for the various
IDEs, although if you happen to be using IntelliJ IDEA (and there are few
reasons not to - there is a version now that is free for everyone to use),
there is a 'QAPlug' plugin that incorporates all three of those tools, plus
Hammurapi (another analysis tool). QAPlug makes it easy to run all of these
tools over your code (although, be warned in the case of DSpace - it takes a
fair amount of time and memory).

QAPlug resulted in something like 18,000 issues reported in DSpace. Most of
those are fairly trivial - unused imports, missing braces, naming
conventions. I've fixed some of those, but not all - it seems a bit of a
thankless task, but it's worth getting the level of noise down, so that you
can see the other issues that actually are problems (and hopefully care
enough to do something about them).

And yes - there were quite a few other issues. Some genuine bugs (like a
misspelling of 'THUMBNAIL'), and some very suspicious bits of multi-threaded
correctness. It also picked up on quite a few efficiency issues - needless
generation of String objects, excessive generation of String objects through
concatenation, loop handling of maps. Whilst I have corrected these
occurrences, and made the code more efficient in that regard, I did err on
the side of correctness - so there have also been changes that made things
maybe less efficient, but hopefully less likely to break. Hopefully, the
testathon will bear this out! Overall though, the balance of those changes
should result in improved efficiency.

But that's barely scratching the surface of performance and scalability.
Early on, I did change the browse 'prune' query to use the more efficient
form. But, over the last week or so, I've put a lot of time into load
testing and profiling DSpace. For note, I've been using YourKit as my
profiler of choice - and they have recently released version 9.5, which is
especially nice (and committers, we have free licenses - please use them :).

I'll start though, by introducing a new class -
org.dspace.testing.PubMedToImport. This uses as a SAX parser to read a
PubMed XML resultset and produce DSpace import packages. It's far from being
a complete (or even absolutely correct) transformation of the PubMed data.
But it does spit out items with mostly correct titles, authors, MeSH subject
headings, and abstracts - as far as metadata only records go, that's enough
for a fairly decent test of most of our subsystems. And being SAX based, it
can work on arbitrarily large files. For example, if you go to
http://www.ncbi.nlm.nih.gov/pubmed and search for 'immune', it will match
380,000 records. You can use the 'send to' function (top right) to generate
a file, in the XML format. This should generate a file of around 3.5 GB
(although there is a tendency for the download to not complete correctly).
Regardless of the size, or being complete/well formed, the SAX parser can
deal with the bulk of it - running PubMedToImport over the 3.5GB file will
quickly generate a large sample dataset to import (as mentioned earlier, of
around 380,000 items).

OK, now we have some data, let's take it to the profiler. First thing to go
was a pruneExcess on every metadata index - as I half suspected thinking it
through, this was a hangover from the very first implementation of this
browse code, that kept near identical item tables for each metadata index.
Prior to release, this was changed to just one bi_item table with multiple
sorting columns, and so each of those prunceExcess was actually operating on
the bi_item table - needlessly re-performing exactly the same operation.

Good start, but we can do better. The reason the pruning takes a long time
is due to the large table join in the metadata indexes between the distinct
table (the unique metadata values), and the map table (mapping the distinct
term to the items that contain it). Although the performance is OK, and it
remains reasonably efficient so long as it can do a hash join (and we can
extend that by increasing the work_mem in Postgres), we can take advantage
of the fact that we know what item we are indexing, and therefore can make
the queries only deal with rows that may be directly affected. When updating
the metadata indexes, we now return a structure saying which distinct ids
have been added, which have been retained, and which removed. From this, we
can just attempt to prune the distinct values that have been removed from
the mappings. If we are deleting the item, then we delete all the mappings,
but return all the distinct values that were unmapped - and then attempt to
prune them.

We can still do better though. If we are importing items, we're adding a lot
of rows to the database - not just browse records, but metadata, items,
workspaces (temporarily). In all cases apart from Metadata, the code was
create (DatabaseManager.create()) the row in the database, form a TableRow
structure, and then after those values have changed, write it back to the
database. But DatabaseManager supported create a TableRow without an
underlying record (DatabaseManager.row()), and inserting your updated
version directly (DatabaseManager.insert()). Whilst, I couldn't fix all of
those instances (for example, the way an Item is created), there were a lot
of places where the create()/update() could be replaced by row()/insert() -
immediately saving many, many UPDATE operations.

That brings us down to an average of 4msecs to create a new record in the
database. Yeah, that sounds fast - but we are doing it a lot in importing
data. Can we make it faster / more efficient? We sure can! Due to the
handling of the primary keys, the create() / insert() actually perform two
operations - one to select a next id, and one to insert a record into the
database (so when we create() / update(), that's actually 3 database
operations). Postgres supports a 'RETURNING' clause on inserts, allowing us
to INSERT with the getnextid() as a value, and return it directlry (Oracle
can also support a similar construct, but it's more complicated, so I left
the two phase operation for that). Well, that's just halved the time take
for any create() / insert() operation. That soon adds up!

Another place we were losing a lot of time was in canonicalizing column
names in the table row - every single time doing a toUpperCase() or
toLowerCase() on a string. Expensive both time and object wise. And in most
cases, we were doing a check on whether the column existed (and pretty much
everywhere we pass in columns in lower case any way). So, taking advantage
of that, I combined the check of the column name with the canonicalization -
it canonicalizes the map when you create a TableRow, but whenever you
attempt to use a column it first tries to see if the column name matches
before canonicalizing it. That saves a massive number of string casing
operations.

Similarly, the TableRowIterator is now handed the ColumnInfo list that
matches the ResultSet that it is iterating over. We don't need to go parsing
the ResultSet metadata every time we call .next()!

The net result is that for a 50,000 item repository, the most expensive
query you'll run is.... SELECT 1. The test that is executed on borrowing
connections from the pool - because the pool doesn't use PreparedStatements.
Importing a few thousand records under the profiler at that level showed
only a few hundred milliseconds being used in total for all the other
SELECTs and INSERTs.

In fact, it was more expensive to update the Lucene index, where each item
import has to open the writer, commit and close the writer. So I tweaked
that. I've created an IndexingTask class inside search, that holds what type
of operation we want to do (ie. UPDATE or DELETE), along with the identifier
(handle Term), and the Document (if necessary) - splitting the creation of
that task (and the parsing of the item into a Lucene Document) away from the
execution of it (deleting or updating the index record). This allowed for
the easy creation of a batch processing mode - where each task is added to a
queue (actually a Map - for reasons that will become clear later), and if
the queue exceeds the limit (default of 20 documents), it is flushed - the
indexWriter is only opened / committed once per flush, making 20-fold (or
more / less depending on your configuration) reduction in the number of
commits.

It is up to the code to set DSIndexer into the batch processing mode -
otherwise it will still commit each indexed item individually - but all the
major batch tasks (import, reindexing, filter-media) have been updated to do
so. But having this split between creating and executing tasks makes it easy
to add a delayed execution mode for the web applications too. So I did. It's
disabled by default, but by setting the length of the delay (in
milliseconds) in the config, it will instead of immediately writing each
item, add it to the queue and start a new Thread (which will sleep for the
specified period). When the queue is full or the Thread wakes up (whichever
is first), it will write out all of the queued indexing tasks.

This may not be attractive for the standard web submissions. But bear in
mind how updates are written if you are editing an item (particularly if you
are adding multiple fields to a record). And remember that the queue is
actually a Map - so if you make many changes to a single item during the
delay period - it will actually consolidate this into a single indexing
task. And it could prove beneficial for SWORD - reducing the frequency in
which it attempts to write the Lucene index when it is getting a flood of
packages submitted by a client.

[Oh, and I also fixed an OutOfMemory exception in the item importer - it
wasn't clearing the Context cache, so you quickly built up a whole load of
imported Items in memory until the point it broke. Now it will import
'forever'].

But, the bottom line, how does this perform? Well, on a standard MacBook
Pro, 4GB RAM - and with a default, local installation of Postgres - I
imported 379,990 records (the PubMed 'immune' search) into my [completely
vanilla DSpace trunk] repository. A repository, that already contained
199,492 items. At the end of the import run, the repository was going to
contain 579,392 items. So how long did it take?

10 hours.

Or more precisely, 36154061 milliseconds. That meant it was averaging over
10 items every second - or 95 milliseconds per item. An average doesn't
quite tell the whole story though, as each individual item was actually
processed a bit faster than that, but the process has to pause periodically
to a) flush the Lucene indexing tasks, and b) merge the Lucene records
(appears to happen every 5 commits - technically, it doesn't have to pause
to do this, as it operates on a background thread, but the import is so damn
fast it has a queue of tasks to write out before the merge is complete, and
has to wait for the IndexWriter to be able to lock the directory).

But note that apart from the Lucene indexing queue, none of the changes were
targetted specifically at batch processing operations. So that sub 100msec
time to create a new Item applies equally to creating records in the web
interface. Updating records in the web interface. Submitting items via
SWORD.

I'll leave you with one thought/notable issue that exists in the web
application if you try to use a repository with hundreds of thousands of
items. Although Postgres will (if it has enough memory) use index scans to
walk through the browse table, paging to the end of the index is quite slow.
In the immediate term, it's quicker to invert the order of the sort, so you
are paging near the start of the order, although you can speed it up by
instead of using 'jump to' a point, making it filter to records that begin
with those characters (so clicking on 'z' would only show items where the
sort value is LIKE 'z%', rather than OFFSETing to that point in the index.
That kind of filtering is really fast and efficient, even with hundreds of
thousand of records. So If you need to tweak the front end performance
before anyone else gets a chance - you know where to start.

G

------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] DSpace 1.7 Scalability, Performance and Correctness

Reply via email to