I'm trying to develop an application that parses scientific journal publications from a few sources and collects them in a single database. The problem is that there's just too damn many of them and it creates a catch-22 situation where I either need literally years to build the database, or I can build the database in a few days with no structure, constraints, or indexing and I run out of memory if I attempt to convert the database to a useful state.
For clarity and simplicity, we'll take the example of parsing a subset of the Scopus publications, 10,000 articles each from the years 1996-2018 (230,000 total articles), and use table definitions that drop the majority of the metadata fields. If I apply constraints to the tables during initialization, the schema looks like this: CREATE TABLE document (docid varchar NOT NULL, title varchar, abstract varchar NULL, PRIMARY KEY (docid) ); CREATE TABLE reference (docid varchar NOT NULL, refid varchar NOT NULL, PRIMARY KEY (docid, refid), FOREIGN KEY (docid) REFERENCES document(docid), FOREIGN KEY (refid) REFERENCES document(docid) ); I then use INSERT ... ON DUPLICATE KEY UPDATE and INSERT IGNORE to add rows to the document table for each article, and for each citation of another article, respectively. The problem with this approach is that since the tables are indexed, the insert time grows quadratically with the size of the table. As a result I can handle 230,000 articles in about 2 hours, but the full 46.7 million will take at least 300 days. In order to defer the application of constraints until after I fully complete parsing, the schema becomes simply: CREATE TABLE document (docid varchar NOT NULL, title varchar, abstract varchar NULL ); CREATE TABLE tempdoc (docid varchar NOT NULL, title varchar ); CREATE TABLE reference (docid varchar NOT NULL, refid varchar NOT NULL ); CREATE TABLE tempref (docid varchar NOT NULL, refid varchar NOT NULL ); Note the inclusion of the temporary tables, which are likely to contain duplicate keys. I perform all full article INSERTs on the document table because they are guaranteed to be unique in the data, and all cited article INSERTs on the tempdoc table. After parsing, I use the following queries to retrieve all rows of the tempdoc table that aren't already in document, and then all rows of the tempref table that are unique and valid: SELECT refid, MAX(title) FROM tempdoc WHERE refid NOT IN (SELECT docid FROM document) GROUP BY refid; SELECT DISTINCT docid, refid FROM tempref WHERE refid IN (SELECT docid FROM document); The trouble with this is that the INSERTs into document following the first query multiply its size by a factor of about 20, and thus by the time I get to the SELECT from tempref the inner SELECT becomes too large. With a subset of just 230,000 articles, the inclusion of the referenced articles adds another 4.52 million. If I bump it up to a larger subset of 2.3 million articles, the number of cited articles adds around 43 million additional rows, and my machine's 64 GB of RAM is no match for the task. The full data set of 46.7 million articles will create at least 900 million additional rows, and I don't think there's enough RAM in the world to handle that. So my question is: is there a middle ground where I can build this database in a reasonable amount of time and without suplexing my RAM? I've already tried converting the String article identifiers to long, which results in a performance loss and doesn't solve the out-of-memory error. I've also tried combining the SELECT and INSERT queries into a single operation to let the database handle all of it on the back end, which is marginally faster but doesn't solve the out-of-memory error. I most recently tried simply SELECTing the entire tempref table and performing single SELECT COUNT queries for each row to check for existence in the reference table - by my math this solution is expected to finish processing the full 46 million articles sometime around 2700 AD. -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/9593f519-010f-4653-a099-7b57e1d15b9c%40googlegroups.com.