On Thu, Jun 06, 2013 at 10:48:32AM -0400, Joe Thornton wrote: > Thanks very much to everyone who replied to my plea for help. > > Elliot -- Thanks for pointing me to the right list. > > Jason: I'm just kicking the tires. I was sure we'd need a much bigger > server if we decide to use Evergreen in production, but I thought 16GB of > RAM and 200GB of disk would be okay for testing. And at this point it's > just 160k bib records. I haven't even gotten to items or patrons yet. > Thanks for sharing your git project. I'm not a git user but I'll give it a > try if I don't find a more familiar solution. > > Dan: Thanks -- I'll probably give your suggestion a try, but 550,000 > records 1,000 at a time will certainly take a while.
I think it will take a lot less time than importing them all in one shot. Just generate an SQL file that imports each block of 1,000 records in a sequence of statements. The problem is that the database can get pretty bloated with a huge transaction. Smaller transactions might give autovacuum a chance to run in parallel with the import. Also, have you tweaked the database configuration at all for things like maintenance_work_mem and work_mem? > Three things: > > -- The process I started three days ago to import 160,000 records using > the method on the Evergreen site is still running. Yeah. That's ridiculous and not representative of a normal Evergreen experience. We imported our ~2 million records more than 4 years ago in less than a weekend. Things have changed since then: ingest is happening entirely in-database now instead of outside of the database, which might introduce some more overhead; but PostgreSQL has also gotten faster and certainly there's lots better hardware available now... > -- Maybe an unfair comparison, but we use VuFind as an alternative > interface to Horizon, and a full import of all 550k records takes about 45 > minutes. To some extent that's Solr vs PostgreSQL, and in some ways it's not so much PostgreSQL's fault as it is that we do an awful lot to shred each record out into potentially hundreds of rows across many tables, all of which are indexed. For example, you can search by specific field/subfield in Evergreen, which adds 25 - 125 rows per record (average of about 40 rows per record in our sample data). VuFind doesn't have to meet the same requirements as Evergreen does for searching records; but yeah, Solr is also just plain fast. > -- It's surprising to me that there isn't a faster method. We're looking > seriously at Evergreen as a replacement for Horizon, but this would be a > problem. I'll try Dan's and then Jason's methods (again, thank you very > much) and hope that they're significantly faster. If I had the time and > ability (unfortunately I have neither) I'd take a shot at it myself. Parallelizing the import, as Jason suggested, certainly can help. I haven't tried this for quite a while, but a really simplistic way of parallelizing the import would be to take that script with many 1,000 record chunks and splitting it into n-1 separate scripts (where "n" is the number of cores on your database server--again, credit due to Jason), then run each script from a separate psql session. You might run into deadlocks this way, but it's worth a shot.