I'm trying to develop an application that parses scientific journal 
publications from a few sources and collects them in a single database.  
The problem is that there's just too damn many of them and it creates a 
catch-22 situation where I either need literally years to build the 
database, or I can build the database in a few days with no structure, 
constraints, or indexing and I run out of memory if I attempt to convert 
the database to a useful state.

For clarity and simplicity, we'll take the example of parsing a subset of 
the Scopus publications, 10,000 articles each from the years 1996-2018 
(230,000 total articles), and use table definitions that drop the majority 
of the metadata fields.

If I apply constraints to the tables during initialization, the schema 
looks like this:
CREATE TABLE document (docid varchar NOT NULL,
                       title varchar,
                       abstract varchar NULL,
                       PRIMARY KEY (docid) );

CREATE TABLE reference (docid varchar NOT NULL,
                        refid varchar NOT NULL,
                        PRIMARY KEY (docid, refid),
                        FOREIGN KEY (docid) REFERENCES document(docid),
                        FOREIGN KEY (refid) REFERENCES document(docid) );

I then use INSERT ... ON DUPLICATE KEY UPDATE and INSERT IGNORE to add rows 
to the document table for each article, and for each citation of another 
article, respectively.

The problem with this approach is that since the tables are indexed, the 
insert time grows quadratically with the size of the table.  As a result I 
can handle 230,000 articles in about 2 hours, but the full 46.7 million 
will take at least 300 days.

In order to defer the application of constraints until after I fully 
complete parsing, the schema becomes simply:
CREATE TABLE document (docid varchar NOT NULL,
                       title varchar,
                       abstract varchar NULL );

CREATE TABLE tempdoc (docid varchar NOT NULL,
                      title varchar );

CREATE TABLE reference (docid varchar NOT NULL,
                        refid varchar NOT NULL );


CREATE TABLE tempref (docid varchar NOT NULL,
                      refid varchar NOT NULL );

Note the inclusion of the temporary tables, which are likely to contain 
duplicate keys.  I perform all full article INSERTs on the document table 
because they are guaranteed to be unique in the data, and all cited article 
INSERTs on the tempdoc table.

After parsing, I use the following queries to retrieve all rows of the 
tempdoc table that aren't already in document, and then all rows of the 
tempref table that are unique and valid:
SELECT refid, MAX(title) FROM tempdoc WHERE refid NOT IN (SELECT docid FROM 
document) GROUP BY refid;

SELECT DISTINCT docid, refid FROM tempref WHERE refid IN (SELECT docid FROM 
document);

The trouble with this is that the INSERTs into document following the first 
query multiply its size by a factor of about 20, and thus by the time I get 
to the SELECT from tempref the inner SELECT becomes too large.  With a 
subset of just 230,000 articles, the inclusion of the referenced articles 
adds another 4.52 million.  If I bump it up to a larger subset of 2.3 
million articles, the number of cited articles adds around 43 million 
additional rows, and my machine's 64 GB of RAM is no match for the task.  
The full data set of 46.7 million articles will create at least 900 million 
additional rows, and I don't think there's enough RAM in the world to 
handle that.

So my question is: is there a middle ground where I can build this database 
in a reasonable amount of time and without suplexing my RAM?  I've already 
tried converting the String article identifiers to long, which results in a 
performance loss and doesn't solve the out-of-memory error.  I've also 
tried combining the SELECT and INSERT queries into a single operation to 
let the database handle all of it on the back end, which is marginally 
faster but doesn't solve the out-of-memory error.  I most recently tried 
simply SELECTing the entire tempref table and performing single SELECT 
COUNT queries for each row to check for existence in the reference table - 
by my math this solution is expected to finish processing the full 46 
million articles sometime around 2700 AD.

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/9593f519-010f-4653-a099-7b57e1d15b9c%40googlegroups.com.

Reply via email to