Re: scaling up from t1n to 60 million records

Martin Mueller Tue, 19 May 2026 11:53:31 -0700

Here is  a more detailed version.

I work on the curation of a corpus of some 65,000 Early Modern  texts with 1.5  
billion words.  They exist as TEI-XML files and each word is wrapped in a <w> 
element.  Here are the first and last two words in the corpus

                  <w lemma="here" pos="av" xml:id="a73abc-001-b-0010">HEre</w>

      <w lemma="begin" pos="vvz" reg="beginneth" 
xml:id="a73abc-001-b-0020">begynneth</w>

...

      <w lemma="mercy" pos="n1" xml:id="e20ady-0008-3120">mercy</w>

          <pc unit="sentence" xml:id="e20ady-0008-3130">.</pc>

The corpus has many corrupt spellings and errors in the linguistic annotation. 
Most of them are low-frequency phenomena and occur in no more than 64 
documents. In nearly all cases you have enough evidence to correct a word or 
its annotation if you can see the word in the middle of a text string that  
includes

  1.
the spelling
  2.
the lemma
  3.
the part of speech tag
  4.
a standard spelling (e.g. 'loue' for 'love')
  5.
up to seven previous words
  6.
up to seven next words
  7.
the spelling and POS tag of the previous
  8.
the lemma and POS tag of the next word
  9.
the Xpath of the current work

My goal is to involve users of the corpus in identifying and correcting corrupt 
readings. I call this a "philological shopping cart" since the offering of a 
correction can be thought of as a sale. Instead of buying something, with the 
machine registering the who, what, when, and where of the purchase, I offer an 
emendation, with the machine registering the who, what, when, and where of my 
emendation.

My hunch is that it would not be particularly difficult to build such a 
philological shopping cart and that in terms of scale it would not be a big 
thing.

I am trying to mirror that "shopping cart" on my Mac.  There are about 60 
million word occurrences that occur in no more than 64 texts. The basic table 
has the columns described above, and half a dozen other columns for data entry 
and various counts.  There are some helper tables. The most important of them 
is a simple case-insensitive list of spellings with their document frequencies. 
 This is very useful for finding suspect spellings with queries like "show me 
all spellings in a low frequency range that contain 'tb' and look for words 
where replacing 'tb' with 'th' will find a word with a higher document 
frequency. That picks up spellings like 'tbe', 'tbat', 'autboritie', etc.

I've worked with KWIC tables of this kind for several years.  I have Aqua Data 
Studio as a front end for Postgres, currently version 17, running on a 
five-year old Mac with an Intel processor and 32 GB of memory.  I know a lot 
less about the innards of a SQL database than I should.

My largest kwic table has about 15 million rows with dozen columns for each 
row. Except for the left and right context, the columns consists of single 
words or numbers. The left and right context columns rarely add up to more than 
35 characters each..  I have used plain indexes for some columns, with commands 
like "Create index on kwics16(keyword)", where 'kwics16' is the table name. My 
typical routine takes  a single-user interactive form: ask a query, wait for 
the results (typically seconds, sometimes a minute or more), and do something 
with the results.  I know next to nothing about the size of the database or 
tables, and it's not something Ihave needed to worry about. There are 
occasional memory bottle necks, because Aqua Data Studio isn't particularly 
good at release memory once it's no longer used. Closing and reopening the 
client fixes that.

It takes an hour or so to upload a table of this kind into the database.  
Several tables of that size exist on my database and don't cause any trouble. I 
don't know at what point I would be running into constraints of an aging Mac 
with 32 GB of memory and a 2 TB hard drive.

I could comfortably live with what I'm doing now, dividing the data into three 
or four frequency ranges.

Given this information, should I try and create a single table or am I likely 
to run into serious constraints if I move beyond my current maximum table size 
of 15 million records.

Perhaps there is no clear answer, and I should just experiment.  But if any 
reader with more knowledge of Postgres thinks that in my environment I would be 
skating on thin ice if I move beyond current limits, I'd be grateful to be told 
so.

Martin Mueller
Professor emeritus of English and Classics
Northwestern University
From: Adrian Klaver <[email protected]>
Date: Tuesday, May 19, 2026 at 09:45
To: Martin Mueller <[email protected]>; 
[email protected] <[email protected]>
Subject: Re: scaling up from t1n to 60 million records

On 5/19/26 7:27 AM, Martin Mueller wrote:
> I use Postgres with a GUI frontend (Aquafold) as a very large
> spreadsheet on steroids that analyzes rare or defective spellings in a
> corpus of 65,000 texts and1.5 billion words.  I typically extract  data
> from the corpus with python scripts, turn them into tables and load them
> into the database.
>
>
> On my Mac with 32 GB of memory performance is OK with queries that
> typically within seconds extract data rows from tables  with up to ten
> million rows.  If the result set is large, I suspect that most of time
> machine's time is spent displaying result sets. I have used indexing
> sparingly. While it helps, the time savings often don't matter much.

This is going to need more information:

1) Postgres version.

2) The table schema including indexes.

3) An example of the query.

4) Where you are measuring the time.

5) The client you are displaying the results in.

>
>
> I am thinking about scaling up to table with about 60 million rows.  Are
> there things to do or watch out for? Or should I proceed on the
> assumption that that 60 million records are within scope and that the
> added timecost is roughly linear?
>
> Martin Mueller
>
> Professor emeritus of English and Classics
>
> Northwestern University
>

--
Adrian Klaver
[email protected]

Re: scaling up from t1n to 60 million records

Reply via email to