On Monday, October 7, 2019 at 3:55:17 PM UTC-4, Noel Grandin wrote:
>
> Also, looking at your schema, and your saying that you are indexing the 
> tables, you may want to consider doing
> CREATE TABLE reference (docid varchar NOT NULL,
>                         refid varchar NOT NULL );
> CREATE TABLE reference_reverse (refid varchar NOT NULL,
>                         docid varchar NOT NULL );
> and only having a primary key index on each one.
>
> This will make certain lookups faster, since the required data will be 
> grouped closer together.
>
>
I'll be sure to experiment with that when I can actually get a database 
built.  In the meantime, I'm still running into the memory overhead limit 
issue.  Here's my current code:
Statement docType = gConn.createStatement();
docType.execute("ALTER TABLE doctype ADD PRIMARY KEY (abbrev);");
docType.close();

Statement refDoc = gConn.createStatement();
refDoc.execute(
        "CREATE TABLE indexDoc ("
                + " docid varchar,"
                + " title varchar,"
                + " content varchar,"
                + " sourceTitle varchar,"
                + " startPage varchar,"
                + " endPage varchar,"
                + " year varchar,"
                + " indexed bit NOT NULL DEFAULT 0 );");

refDoc.executeUpdate("INSERT INTO indexDoc (docid, indexed) ( SELECT docid, 
1 FROM document );");

refDoc.executeUpdate(
        "INSERT INTO indexDoc (docid, title, content, sourceTitle, 
startPage, endPage, year) ("
                + " SELECT refid, MAX(title), MAX(content), 
MAX(sourceTitle), MAX(startPage), MAX(endPage), MAX(year)"
                + " FROM refdoc"
                + " GROUP BY refid );");

ResultSet refDocs = refDoc.executeQuery("SELECT docid, MAX(title), 
MAX(content)"
                + " FROM indexDoc"
                + " GROUP BY docid"
                + " HAVING MAX(indexed) = 0;");

while (refDocs.next())
{
    Document doc = new Document();
    doc.add(new Field("ID", refDocs.getString(1), FieldFactory.field(false
)));
    doc.add(new Field("title", refDocs.getString(2), FieldFactory.field(
false)));
    doc.add(new Field("content", refDocs.getString(3), FieldFactory.field(
true)));
    writer.addDocument(doc);
}
refDocs.close();

refDoc.executeUpdate(
        "INSERT INTO document (docid, title, sourceTitle, startPage, 
endPage, year) ("
                + " SELECT docid, MAX(title), MAX(sourceTitle), 
MAX(startPage), MAX(endPage), MAX(year)"
                + " FROM indexDoc"
                + " GROUP BY docid"
                + " HAVING MAX(indexed) = 0 );");

refDoc.execute("DROP TABLE indexDoc;");
refDoc.execute("DROP TABLE refdoc;");

refDoc.close();

Statement doc = gConn.createStatement();
doc.execute("ALTER TABLE document ADD PRIMARY KEY (docid);");
doc.execute("ALTER TABLE document ADD FOREIGN KEY (docType) REFERENCES 
doctype(abbrev);");
doc.close();

Statement ref = gConn.createStatement();
ref.execute("CREATE TABLE reference AS ( SELECT DISTINCT docid, refid FROM 
tempref );");

ref.execute("DROP TABLE tempref;");

ref.execute("ALTER TABLE reference MODIFY COLUMN docid varchar NOT NULL;");
ref.execute("ALTER TABLE reference MODIFY COLUMN refid varchar NOT NULL;");
ref.execute("ALTER TABLE reference ADD PRIMARY KEY (docid, refid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (docid) REFERENCES 
document(docid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (refid) REFERENCES 
document(docid);");
ref.close();

As you can see, I'm also building a Lucene index along with the database, 
and in order to prevent articles from being added to it more than once I 
have to make use of the database to build a list of all articles using the 
unique IDs for those articles, and keep track of which ones have already 
been indexed.  For some reason, this query works just fine:
SELECT docid, MAX(title), MAX(content)
FROM indexDoc
GROUP BY docid
HAVING MAX(indexed) = 0;

but this query fails due to a memory limit:
INSERT INTO document (docid, title, sourceTitle, startPage, endPage, year) (
SELECT docid, MAX(title), MAX(sourceTitle), MAX(startPage), MAX(endPage), 
MAX(year)
FROM indexDoc
GROUP BY docid
HAVING MAX(indexed) = 0 );

My suspicion is that nesting a SELECT inside any other function is going to 
cause these kinds of errors, and I will need to instead retrieve a 
ResultSet and use a PreparedStatement to batch-INSERT into tables in chunks 
of ~10,000 or so.

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/ef13dc9c-4124-49ad-864a-087ef42fbfe0%40googlegroups.com.

Reply via email to