Graham, CONGRATULATIONS for this fantastic work! I began to apply FindBugs advices but never ended because it would had made my code completely different from the trunk. Having your corrections in the trunk is really nice!
I suppose you corrected the nasty issue with sometime unclosed SQL transactions that FindBugs flags? This removed me many of the deadlocks in operation. But, as you say, this will need an in depth reassesment of the whole DSpace codebase. I am importing PubMed records for years now. With "current" DSpace, this is about 1 second per item. A tenfold improvement is really impressive (and useful). I will also investigate your many other advices. THANKS! Christophe In reply to: -------------- Date: Mon, 8 Nov 2010 17:59:11 +0000 From: Graham Triggs<[email protected]> Subject: [Dspace-devel] DSpace 1.7 Scalability, Performance and Correctness To: dspace-devel<[email protected]> Message-ID: <[email protected]> Content-Type: text/plain; charset="iso-8859-1" Hi, For anyone following the dspace-changelog email list, you may have noticed one or two commits bearing my name. So, a little explanation of what has been happening as we head into the release candidate / testathon. The majority of commits in October were related to static analysis of the code. There are three main (free) tools that handle static analysis - Findbugs (probably the best at finding genuine issues to resolve), PMD (finds more stuff, but a fair bit of noise and/or things that aren't that critical), and Checkstyle (rarely anything serious, but nice to have consistent style across the code). All of these have plugins for the various IDEs, although if you happen to be using IntelliJ IDEA (and there are few reasons not to - there is a version now that is free for everyone to use), there is a 'QAPlug' plugin that incorporates all three of those tools, plus Hammurapi (another analysis tool). QAPlug makes it easy to run all of these tools over your code (although, be warned in the case of DSpace - it takes a fair amount of time and memory). QAPlug resulted in something like 18,000 issues reported in DSpace. Most of those are fairly trivial - unused imports, missing braces, naming conventions. I've fixed some of those, but not all - it seems a bit of a thankless task, but it's worth getting the level of noise down, so that you can see the other issues that actually are problems (and hopefully care enough to do something about them). And yes - there were quite a few other issues. Some genuine bugs (like a misspelling of 'THUMBNAIL'), and some very suspicious bits of multi-threaded correctness. It also picked up on quite a few efficiency issues - needless generation of String objects, excessive generation of String objects through concatenation, loop handling of maps. Whilst I have corrected these occurrences, and made the code more efficient in that regard, I did err on the side of correctness - so there have also been changes that made things maybe less efficient, but hopefully less likely to break. Hopefully, the testathon will bear this out! Overall though, the balance of those changes should result in improved efficiency. But that's barely scratching the surface of performance and scalability. Early on, I did change the browse 'prune' query to use the more efficient form. But, over the last week or so, I've put a lot of time into load testing and profiling DSpace. For note, I've been using YourKit as my profiler of choice - and they have recently released version 9.5, which is especially nice (and committers, we have free licenses - please use them:). I'll start though, by introducing a new class - org.dspace.testing.PubMedToImport. This uses as a SAX parser to read a PubMed XML resultset and produce DSpace import packages. It's far from being a complete (or even absolutely correct) transformation of the PubMed data. But it does spit out items with mostly correct titles, authors, MeSH subject headings, and abstracts - as far as metadata only records go, that's enough for a fairly decent test of most of our subsystems. And being SAX based, it can work on arbitrarily large files. For example, if you go to http://www.ncbi.nlm.nih.gov/pubmed and search for 'immune', it will match 380,000 records. You can use the 'send to' function (top right) to generate a file, in the XML format. This should generate a file of around 3.5 GB (although there is a tendency for the download to not complete correctly). Regardless of the size, or being complete/well formed, the SAX parser can deal with the bulk of it - running PubMedToImport over the 3.5GB file will quickly generate a large sample dataset to import (as mentioned earlier, of around 380,000 items). OK, now we have some data, let's take it to the profiler. First thing to go was a pruneExcess on every metadata index - as I half suspected thinking it through, this was a hangover from the very first implementation of this browse code, that kept near identical item tables for each metadata index. Prior to release, this was changed to just one bi_item table with multiple sorting columns, and so each of those prunceExcess was actually operating on the bi_item table - needlessly re-performing exactly the same operation. Good start, but we can do better. The reason the pruning takes a long time is due to the large table join in the metadata indexes between the distinct table (the unique metadata values), and the map table (mapping the distinct term to the items that contain it). Although the performance is OK, and it remains reasonably efficient so long as it can do a hash join (and we can extend that by increasing the work_mem in Postgres), we can take advantage of the fact that we know what item we are indexing, and therefore can make the queries only deal with rows that may be directly affected. When updating the metadata indexes, we now return a structure saying which distinct ids have been added, which have been retained, and which removed. From this, we can just attempt to prune the distinct values that have been removed from the mappings. If we are deleting the item, then we delete all the mappings, but return all the distinct values that were unmapped - and then attempt to prune them. We can still do better though. If we are importing items, we're adding a lot of rows to the database - not just browse records, but metadata, items, workspaces (temporarily). In all cases apart from Metadata, the code was create (DatabaseManager.create()) the row in the database, form a TableRow structure, and then after those values have changed, write it back to the database. But DatabaseManager supported create a TableRow without an underlying record (DatabaseManager.row()), and inserting your updated version directly (DatabaseManager.insert()). Whilst, I couldn't fix all of those instances (for example, the way an Item is created), there were a lot of places where the create()/update() could be replaced by row()/insert() - immediately saving many, many UPDATE operations. That brings us down to an average of 4msecs to create a new record in the database. Yeah, that sounds fast - but we are doing it a lot in importing data. Can we make it faster / more efficient? We sure can! Due to the handling of the primary keys, the create() / insert() actually perform two operations - one to select a next id, and one to insert a record into the database (so when we create() / update(), that's actually 3 database operations). Postgres supports a 'RETURNING' clause on inserts, allowing us to INSERT with the getnextid() as a value, and return it directlry (Oracle can also support a similar construct, but it's more complicated, so I left the two phase operation for that). Well, that's just halved the time take for any create() / insert() operation. That soon adds up! Another place we were losing a lot of time was in canonicalizing column names in the table row - every single time doing a toUpperCase() or toLowerCase() on a string. Expensive both time and object wise. And in most cases, we were doing a check on whether the column existed (and pretty much everywhere we pass in columns in lower case any way). So, taking advantage of that, I combined the check of the column name with the canonicalization - it canonicalizes the map when you create a TableRow, but whenever you attempt to use a column it first tries to see if the column name matches before canonicalizing it. That saves a massive number of string casing operations. Similarly, the TableRowIterator is now handed the ColumnInfo list that matches the ResultSet that it is iterating over. We don't need to go parsing the ResultSet metadata every time we call .next()! The net result is that for a 50,000 item repository, the most expensive query you'll run is.... SELECT 1. The test that is executed on borrowing connections from the pool - because the pool doesn't use PreparedStatements. Importing a few thousand records under the profiler at that level showed only a few hundred milliseconds being used in total for all the other SELECTs and INSERTs. In fact, it was more expensive to update the Lucene index, where each item import has to open the writer, commit and close the writer. So I tweaked that. I've created an IndexingTask class inside search, that holds what type of operation we want to do (ie. UPDATE or DELETE), along with the identifier (handle Term), and the Document (if necessary) - splitting the creation of that task (and the parsing of the item into a Lucene Document) away from the execution of it (deleting or updating the index record). This allowed for the easy creation of a batch processing mode - where each task is added to a queue (actually a Map - for reasons that will become clear later), and if the queue exceeds the limit (default of 20 documents), it is flushed - the indexWriter is only opened / committed once per flush, making 20-fold (or more / less depending on your configuration) reduction in the number of commits. It is up to the code to set DSIndexer into the batch processing mode - otherwise it will still commit each indexed item individually - but all the major batch tasks (import, reindexing, filter-media) have been updated to do so. But having this split between creating and executing tasks makes it easy to add a delayed execution mode for the web applications too. So I did. It's disabled by default, but by setting the length of the delay (in milliseconds) in the config, it will instead of immediately writing each item, add it to the queue and start a new Thread (which will sleep for the specified period). When the queue is full or the Thread wakes up (whichever is first), it will write out all of the queued indexing tasks. This may not be attractive for the standard web submissions. But bear in mind how updates are written if you are editing an item (particularly if you are adding multiple fields to a record). And remember that the queue is actually a Map - so if you make many changes to a single item during the delay period - it will actually consolidate this into a single indexing task. And it could prove beneficial for SWORD - reducing the frequency in which it attempts to write the Lucene index when it is getting a flood of packages submitted by a client. [Oh, and I also fixed an OutOfMemory exception in the item importer - it wasn't clearing the Context cache, so you quickly built up a whole load of imported Items in memory until the point it broke. Now it will import 'forever']. But, the bottom line, how does this perform? Well, on a standard MacBook Pro, 4GB RAM - and with a default, local installation of Postgres - I imported 379,990 records (the PubMed 'immune' search) into my [completely vanilla DSpace trunk] repository. A repository, that already contained 199,492 items. At the end of the import run, the repository was going to contain 579,392 items. So how long did it take? 10 hours. Or more precisely, 36154061 milliseconds. That meant it was averaging over 10 items every second - or 95 milliseconds per item. An average doesn't quite tell the whole story though, as each individual item was actually processed a bit faster than that, but the process has to pause periodically to a) flush the Lucene indexing tasks, and b) merge the Lucene records (appears to happen every 5 commits - technically, it doesn't have to pause to do this, as it operates on a background thread, but the import is so damn fast it has a queue of tasks to write out before the merge is complete, and has to wait for the IndexWriter to be able to lock the directory). But note that apart from the Lucene indexing queue, none of the changes were targetted specifically at batch processing operations. So that sub 100msec time to create a new Item applies equally to creating records in the web interface. Updating records in the web interface. Submitting items via SWORD. I'll leave you with one thought/notable issue that exists in the web application if you try to use a repository with hundreds of thousands of items. Although Postgres will (if it has enough memory) use index scans to walk through the browse table, paging to the end of the index is quite slow. In the immediate term, it's quicker to invert the order of the sort, so you are paging near the start of the order, although you can speed it up by instead of using 'jump to' a point, making it filter to records that begin with those characters (so clicking on 'z' would only show items where the sort value is LIKE 'z%', rather than OFFSETing to that point in the index. That kind of filtering is really fast and efficient, even with hundreds of thousand of records. So If you need to tweak the front end performance before anyone else gets a chance - you know where to start. G ------------------------------------------------------------------------------ The Next 800 Companies to Lead America's Growth: New Video Whitepaper David G. Thomson, author of the best-selling book "Blueprint to a Billion" shares his insights and actions to help propel your business during the next growth cycle. Listen Now! http://p.sf.net/sfu/SAP-dev2dev _______________________________________________ Dspace-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-devel
