Here is a more detailed version.
I work on the curation of a corpus of some 65,000 Early Modern texts with 1.5
billion words. They exist as TEI-XML files and each word is wrapped in a <w>
element. Here are the first and last two words in the corpus
<w lemma="here" pos="av" xml:id="a73abc-001-b-0010">HEre</w>
<w lemma="begin" pos="vvz" reg="beginneth"
xml:id="a73abc-001-b-0020">begynneth</w>
...
<w lemma="mercy" pos="n1" xml:id="e20ady-0008-3120">mercy</w>
<pc unit="sentence" xml:id="e20ady-0008-3130">.</pc>
The corpus has many corrupt spellings and errors in the linguistic annotation.
Most of them are low-frequency phenomena and occur in no more than 64
documents. In nearly all cases you have enough evidence to correct a word or
its annotation if you can see the word in the middle of a text string that
includes
1.
the spelling
2.
the lemma
3.
the part of speech tag
4.
a standard spelling (e.g. 'loue' for 'love')
5.
up to seven previous words
6.
up to seven next words
7.
the spelling and POS tag of the previous
8.
the lemma and POS tag of the next word
9.
the Xpath of the current work
My goal is to involve users of the corpus in identifying and correcting corrupt
readings. I call this a "philological shopping cart" since the offering of a
correction can be thought of as a sale. Instead of buying something, with the
machine registering the who, what, when, and where of the purchase, I offer an
emendation, with the machine registering the who, what, when, and where of my
emendation.
My hunch is that it would not be particularly difficult to build such a
philological shopping cart and that in terms of scale it would not be a big
thing.
I am trying to mirror that "shopping cart" on my Mac. There are about 60
million word occurrences that occur in no more than 64 texts. The basic table
has the columns described above, and half a dozen other columns for data entry
and various counts. There are some helper tables. The most important of them
is a simple case-insensitive list of spellings with their document frequencies.
This is very useful for finding suspect spellings with queries like "show me
all spellings in a low frequency range that contain 'tb' and look for words
where replacing 'tb' with 'th' will find a word with a higher document
frequency. That picks up spellings like 'tbe', 'tbat', 'autboritie', etc.
I've worked with KWIC tables of this kind for several years. I have Aqua Data
Studio as a front end for Postgres, currently version 17, running on a
five-year old Mac with an Intel processor and 32 GB of memory. I know a lot
less about the innards of a SQL database than I should.
My largest kwic table has about 15 million rows with dozen columns for each
row. Except for the left and right context, the columns consists of single
words or numbers. The left and right context columns rarely add up to more than
35 characters each.. I have used plain indexes for some columns, with commands
like "Create index on kwics16(keyword)", where 'kwics16' is the table name. My
typical routine takes a single-user interactive form: ask a query, wait for
the results (typically seconds, sometimes a minute or more), and do something
with the results. I know next to nothing about the size of the database or
tables, and it's not something Ihave needed to worry about. There are
occasional memory bottle necks, because Aqua Data Studio isn't particularly
good at release memory once it's no longer used. Closing and reopening the
client fixes that.
It takes an hour or so to upload a table of this kind into the database.
Several tables of that size exist on my database and don't cause any trouble. I
don't know at what point I would be running into constraints of an aging Mac
with 32 GB of memory and a 2 TB hard drive.
I could comfortably live with what I'm doing now, dividing the data into three
or four frequency ranges.
Given this information, should I try and create a single table or am I likely
to run into serious constraints if I move beyond my current maximum table size
of 15 million records.
Perhaps there is no clear answer, and I should just experiment. But if any
reader with more knowledge of Postgres thinks that in my environment I would be
skating on thin ice if I move beyond current limits, I'd be grateful to be told
so.
Martin Mueller
Professor emeritus of English and Classics
Northwestern University
From: Adrian Klaver <[email protected]>
Date: Tuesday, May 19, 2026 at 09:45
To: Martin Mueller <[email protected]>;
[email protected] <[email protected]>
Subject: Re: scaling up from t1n to 60 million records
On 5/19/26 7:27 AM, Martin Mueller wrote:
> I use Postgres with a GUI frontend (Aquafold) as a very large
> spreadsheet on steroids that analyzes rare or defective spellings in a
> corpus of 65,000 texts and1.5 billion words. I typically extract data
> from the corpus with python scripts, turn them into tables and load them
> into the database.
>
>
> On my Mac with 32 GB of memory performance is OK with queries that
> typically within seconds extract data rows from tables with up to ten
> million rows. If the result set is large, I suspect that most of time
> machine's time is spent displaying result sets. I have used indexing
> sparingly. While it helps, the time savings often don't matter much.
This is going to need more information:
1) Postgres version.
2) The table schema including indexes.
3) An example of the query.
4) Where you are measuring the time.
5) The client you are displaying the results in.
>
>
> I am thinking about scaling up to table with about 60 million rows. Are
> there things to do or watch out for? Or should I proceed on the
> assumption that that 60 million records are within scope and that the
> added timecost is roughly linear?
>
> Martin Mueller
>
> Professor emeritus of English and Classics
>
> Northwestern University
>
--
Adrian Klaver
[email protected]