I'm trying to develop an application that parses scientific journal
publications from a few sources and collects them in a single database.
The problem is that there's just too damn many of them and it creates a
catch-22 situation where I either need literally years to build the
database, or I can build the database in a few days with no structure,
constraints, or indexing and I run out of memory if I attempt to convert
the database to a useful state.
For clarity and simplicity, we'll take the example of parsing a subset of
the Scopus publications, 10,000 articles each from the years 1996-2018
(230,000 total articles), and use table definitions that drop the majority
of the metadata fields.
If I apply constraints to the tables during initialization, the schema
looks like this:
CREATE TABLE document (docid varchar NOT NULL,
title varchar,
abstract varchar NULL,
PRIMARY KEY (docid) );
CREATE TABLE reference (docid varchar NOT NULL,
refid varchar NOT NULL,
PRIMARY KEY (docid, refid),
FOREIGN KEY (docid) REFERENCES document(docid),
FOREIGN KEY (refid) REFERENCES document(docid) );
I then use INSERT ... ON DUPLICATE KEY UPDATE and INSERT IGNORE to add rows
to the document table for each article, and for each citation of another
article, respectively.
The problem with this approach is that since the tables are indexed, the
insert time grows quadratically with the size of the table. As a result I
can handle 230,000 articles in about 2 hours, but the full 46.7 million
will take at least 300 days.
In order to defer the application of constraints until after I fully
complete parsing, the schema becomes simply:
CREATE TABLE document (docid varchar NOT NULL,
title varchar,
abstract varchar NULL );
CREATE TABLE tempdoc (docid varchar NOT NULL,
title varchar );
CREATE TABLE reference (docid varchar NOT NULL,
refid varchar NOT NULL );
CREATE TABLE tempref (docid varchar NOT NULL,
refid varchar NOT NULL );
Note the inclusion of the temporary tables, which are likely to contain
duplicate keys. I perform all full article INSERTs on the document table
because they are guaranteed to be unique in the data, and all cited article
INSERTs on the tempdoc table.
After parsing, I use the following queries to retrieve all rows of the
tempdoc table that aren't already in document, and then all rows of the
tempref table that are unique and valid:
SELECT refid, MAX(title) FROM tempdoc WHERE refid NOT IN (SELECT docid FROM
document) GROUP BY refid;
SELECT DISTINCT docid, refid FROM tempref WHERE refid IN (SELECT docid FROM
document);
The trouble with this is that the INSERTs into document following the first
query multiply its size by a factor of about 20, and thus by the time I get
to the SELECT from tempref the inner SELECT becomes too large. With a
subset of just 230,000 articles, the inclusion of the referenced articles
adds another 4.52 million. If I bump it up to a larger subset of 2.3
million articles, the number of cited articles adds around 43 million
additional rows, and my machine's 64 GB of RAM is no match for the task.
The full data set of 46.7 million articles will create at least 900 million
additional rows, and I don't think there's enough RAM in the world to
handle that.
So my question is: is there a middle ground where I can build this database
in a reasonable amount of time and without suplexing my RAM? I've already
tried converting the String article identifiers to long, which results in a
performance loss and doesn't solve the out-of-memory error. I've also
tried combining the SELECT and INSERT queries into a single operation to
let the database handle all of it on the back end, which is marginally
faster but doesn't solve the out-of-memory error. I most recently tried
simply SELECTing the entire tempref table and performing single SELECT
COUNT queries for each row to check for existence in the reference table -
by my math this solution is expected to finish processing the full 46
million articles sometime around 2700 AD.
--
You received this message because you are subscribed to the Google Groups "H2
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/h2-database/9593f519-010f-4653-a099-7b57e1d15b9c%40googlegroups.com.