On Monday, October 7, 2019 at 3:55:17 PM UTC-4, Noel Grandin wrote:
>
> Also, looking at your schema, and your saying that you are indexing the
> tables, you may want to consider doing
> CREATE TABLE reference (docid varchar NOT NULL,
> refid varchar NOT NULL );
> CREATE TABLE reference_reverse (refid varchar NOT NULL,
> docid varchar NOT NULL );
> and only having a primary key index on each one.
>
> This will make certain lookups faster, since the required data will be
> grouped closer together.
>
>
I'll be sure to experiment with that when I can actually get a database
built. In the meantime, I'm still running into the memory overhead limit
issue. Here's my current code:
Statement docType = gConn.createStatement();
docType.execute("ALTER TABLE doctype ADD PRIMARY KEY (abbrev);");
docType.close();
Statement refDoc = gConn.createStatement();
refDoc.execute(
"CREATE TABLE indexDoc ("
+ " docid varchar,"
+ " title varchar,"
+ " content varchar,"
+ " sourceTitle varchar,"
+ " startPage varchar,"
+ " endPage varchar,"
+ " year varchar,"
+ " indexed bit NOT NULL DEFAULT 0 );");
refDoc.executeUpdate("INSERT INTO indexDoc (docid, indexed) ( SELECT docid,
1 FROM document );");
refDoc.executeUpdate(
"INSERT INTO indexDoc (docid, title, content, sourceTitle,
startPage, endPage, year) ("
+ " SELECT refid, MAX(title), MAX(content),
MAX(sourceTitle), MAX(startPage), MAX(endPage), MAX(year)"
+ " FROM refdoc"
+ " GROUP BY refid );");
ResultSet refDocs = refDoc.executeQuery("SELECT docid, MAX(title),
MAX(content)"
+ " FROM indexDoc"
+ " GROUP BY docid"
+ " HAVING MAX(indexed) = 0;");
while (refDocs.next())
{
Document doc = new Document();
doc.add(new Field("ID", refDocs.getString(1), FieldFactory.field(false
)));
doc.add(new Field("title", refDocs.getString(2), FieldFactory.field(
false)));
doc.add(new Field("content", refDocs.getString(3), FieldFactory.field(
true)));
writer.addDocument(doc);
}
refDocs.close();
refDoc.executeUpdate(
"INSERT INTO document (docid, title, sourceTitle, startPage,
endPage, year) ("
+ " SELECT docid, MAX(title), MAX(sourceTitle),
MAX(startPage), MAX(endPage), MAX(year)"
+ " FROM indexDoc"
+ " GROUP BY docid"
+ " HAVING MAX(indexed) = 0 );");
refDoc.execute("DROP TABLE indexDoc;");
refDoc.execute("DROP TABLE refdoc;");
refDoc.close();
Statement doc = gConn.createStatement();
doc.execute("ALTER TABLE document ADD PRIMARY KEY (docid);");
doc.execute("ALTER TABLE document ADD FOREIGN KEY (docType) REFERENCES
doctype(abbrev);");
doc.close();
Statement ref = gConn.createStatement();
ref.execute("CREATE TABLE reference AS ( SELECT DISTINCT docid, refid FROM
tempref );");
ref.execute("DROP TABLE tempref;");
ref.execute("ALTER TABLE reference MODIFY COLUMN docid varchar NOT NULL;");
ref.execute("ALTER TABLE reference MODIFY COLUMN refid varchar NOT NULL;");
ref.execute("ALTER TABLE reference ADD PRIMARY KEY (docid, refid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (docid) REFERENCES
document(docid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (refid) REFERENCES
document(docid);");
ref.close();
As you can see, I'm also building a Lucene index along with the database,
and in order to prevent articles from being added to it more than once I
have to make use of the database to build a list of all articles using the
unique IDs for those articles, and keep track of which ones have already
been indexed. For some reason, this query works just fine:
SELECT docid, MAX(title), MAX(content)
FROM indexDoc
GROUP BY docid
HAVING MAX(indexed) = 0;
but this query fails due to a memory limit:
INSERT INTO document (docid, title, sourceTitle, startPage, endPage, year) (
SELECT docid, MAX(title), MAX(sourceTitle), MAX(startPage), MAX(endPage),
MAX(year)
FROM indexDoc
GROUP BY docid
HAVING MAX(indexed) = 0 );
My suspicion is that nesting a SELECT inside any other function is going to
cause these kinds of errors, and I will need to instead retrieve a
ResultSet and use a PreparedStatement to batch-INSERT into tables in chunks
of ~10,000 or so.
--
You received this message because you are subscribed to the Google Groups "H2
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/h2-database/ef13dc9c-4124-49ad-864a-087ef42fbfe0%40googlegroups.com.