Re: [h2] "SELECT ... WHERE ... IN" on extremely large tables

Tim Fielder Wed, 09 Oct 2019 03:23:26 -0700

On Monday, October 7, 2019 at 3:55:17 PM UTC-4, Noel Grandin wrote:
>
> Also, looking at your schema, and your saying that you are indexing the 
> tables, you may want to consider doing
> CREATE TABLE reference (docid varchar NOT NULL,
>                         refid varchar NOT NULL );
> CREATE TABLE reference_reverse (refid varchar NOT NULL,
>                         docid varchar NOT NULL );
> and only having a primary key index on each one.
>
> This will make certain lookups faster, since the required data will be 
> grouped closer together.
>
>
I'll be sure to experiment with that when I can actually get a database 
built.  In the meantime, I'm still running into the memory overhead limit 
issue.  Here's my current code:
Statement docType = gConn.createStatement();
docType.execute("ALTER TABLE doctype ADD PRIMARY KEY (abbrev);");
docType.close();


Statement refDoc = gConn.createStatement();
refDoc.execute(
        "CREATE TABLE indexDoc ("
                + " docid varchar,"
                + " title varchar,"
                + " content varchar,"
                + " sourceTitle varchar,"
                + " startPage varchar,"
                + " endPage varchar,"
                + " year varchar,"
                + " indexed bit NOT NULL DEFAULT 0 );");

refDoc.executeUpdate("INSERT INTO indexDoc (docid, indexed) ( SELECT docid, 
1 FROM document );");

refDoc.executeUpdate(
        "INSERT INTO indexDoc (docid, title, content, sourceTitle, 
startPage, endPage, year) ("
                + " SELECT refid, MAX(title), MAX(content), 
MAX(sourceTitle), MAX(startPage), MAX(endPage), MAX(year)"
                + " FROM refdoc"
                + " GROUP BY refid );");

ResultSet refDocs = refDoc.executeQuery("SELECT docid, MAX(title), 
MAX(content)"
                + " FROM indexDoc"
                + " GROUP BY docid"
                + " HAVING MAX(indexed) = 0;");

while (refDocs.next())
{
    Document doc = new Document();
    doc.add(new Field("ID", refDocs.getString(1), FieldFactory.field(false
)));
    doc.add(new Field("title", refDocs.getString(2), FieldFactory.field(
false)));
    doc.add(new Field("content", refDocs.getString(3), FieldFactory.field(
true)));
    writer.addDocument(doc);
}
refDocs.close();

refDoc.executeUpdate(
        "INSERT INTO document (docid, title, sourceTitle, startPage, 
endPage, year) ("
                + " SELECT docid, MAX(title), MAX(sourceTitle), 
MAX(startPage), MAX(endPage), MAX(year)"
                + " FROM indexDoc"
                + " GROUP BY docid"
                + " HAVING MAX(indexed) = 0 );");

refDoc.execute("DROP TABLE indexDoc;");
refDoc.execute("DROP TABLE refdoc;");

refDoc.close();

Statement doc = gConn.createStatement();
doc.execute("ALTER TABLE document ADD PRIMARY KEY (docid);");
doc.execute("ALTER TABLE document ADD FOREIGN KEY (docType) REFERENCES 
doctype(abbrev);");
doc.close();

Statement ref = gConn.createStatement();
ref.execute("CREATE TABLE reference AS ( SELECT DISTINCT docid, refid FROM 
tempref );");

ref.execute("DROP TABLE tempref;");

ref.execute("ALTER TABLE reference MODIFY COLUMN docid varchar NOT NULL;");
ref.execute("ALTER TABLE reference MODIFY COLUMN refid varchar NOT NULL;");
ref.execute("ALTER TABLE reference ADD PRIMARY KEY (docid, refid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (docid) REFERENCES 
document(docid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (refid) REFERENCES 
document(docid);");
ref.close();

As you can see, I'm also building a Lucene index along with the database, 
and in order to prevent articles from being added to it more than once I 
have to make use of the database to build a list of all articles using the 
unique IDs for those articles, and keep track of which ones have already 
been indexed.  For some reason, this query works just fine:
SELECT docid, MAX(title), MAX(content)
FROM indexDoc
GROUP BY docid
HAVING MAX(indexed) = 0;

but this query fails due to a memory limit:
INSERT INTO document (docid, title, sourceTitle, startPage, endPage, year) (
SELECT docid, MAX(title), MAX(sourceTitle), MAX(startPage), MAX(endPage), 
MAX(year)
FROM indexDoc
GROUP BY docid
HAVING MAX(indexed) = 0 );

My suspicion is that nesting a SELECT inside any other function is going to 
cause these kinds of errors, and I will need to instead retrieve a 
ResultSet and use a PreparedStatement to batch-INSERT into tables in chunks 
of ~10,000 or so.

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/ef13dc9c-4124-49ad-864a-087ef42fbfe0%40googlegroups.com.

Re: [h2] "SELECT ... WHERE ... IN" on extremely large tables

Reply via email to