Indices are your friend! (Except when loading data.) Add them on any relevant column. https://www.postgresql.org/docs/16/sql-createindex.html
The "(expression)" clause might be useful in your situation, since it can exclude some words from an index, exclude empty cells, index upper-case versions of the word, etc. On Tue, May 19, 2026 at 2:53 PM Martin Mueller < [email protected]> wrote: > Here is a more detailed version. > > I work on the curation of a corpus of some 65,000 Early Modern texts with > 1.5 billion words. They exist as TEI-XML files and each word is wrapped > in a <w> element. Here are the first and last two words in the corpus > > <w lemma="here" pos="av" xml:id="a73abc-001-b-0010">HEre</w> > > > <w lemma="begin" pos="vvz" reg="beginneth" > xml:id="a73abc-001-b-0020">begynneth</w> > > ... > > <w lemma="mercy" pos="n1" xml:id="e20ady-0008-3120">mercy</w> > > <pc unit="sentence" xml:id="e20ady-0008-3130">.</pc> > > > The corpus has many corrupt spellings and errors in the linguistic > annotation. Most of them are low-frequency phenomena and occur in no more > than 64 documents. In nearly all cases you have enough evidence to correct > a word or its annotation if you can see the word in the middle of a text > string that includes > > > 1. the spelling > 2. the lemma > 3. the part of speech tag > 4. a standard spelling (e.g. 'loue' for 'love') > 5. up to seven previous words > 6. up to seven next words > 7. the spelling and POS tag of the previous > 8. the lemma and POS tag of the next word > 9. the Xpath of the current work > > > My goal is to involve users of the corpus in identifying and correcting > corrupt readings. I call this a "philological shopping cart" since the > offering of a correction can be thought of as a sale. Instead of buying > something, with the machine registering the who, what, when, and where of > the purchase, I offer an emendation, with the machine registering the who, > what, when, and where of my emendation. > > My hunch is that it would not be particularly difficult to build such a > philological shopping cart and that in terms of scale it would not be a big > thing. > > I am trying to mirror that "shopping cart" on my Mac. There are about 60 > million word occurrences that occur in no more than 64 texts. The basic > table has the columns described above, and half a dozen other columns for > data entry and various counts. There are some helper tables. The most > important of them is a simple case-insensitive list of spellings with their > document frequencies. This is very useful for finding suspect spellings > with queries like "show me all spellings in a low frequency range that > contain 'tb' and look for words where replacing 'tb' with 'th' will find a > word with a higher document frequency. That picks up spellings like 'tbe', > 'tbat', 'autboritie', etc. > > I've worked with KWIC tables of this kind for several years. I have Aqua > Data Studio as a front end for Postgres, currently version 17, running on a > five-year old Mac with an Intel processor and 32 GB of memory. I know a > lot less about the innards of a SQL database than I should. > > My largest kwic table has about 15 million rows with dozen columns for > each row. Except for the left and right context, the columns consists of > single words or numbers. The left and right context columns rarely add up > to more than 35 characters each.. I have used plain indexes for some > columns, with commands like "Create index on kwics16(keyword)", where > 'kwics16' is the table name. My typical routine takes a single-user > interactive form: ask a query, wait for the results (typically seconds, > sometimes a minute or more), and do something with the results. I know > next to nothing about the size of the database or tables, and it's not > something Ihave needed to worry about. There are occasional memory bottle > necks, because Aqua Data Studio isn't particularly good at release memory > once it's no longer used. Closing and reopening the client fixes that. > > It takes an hour or so to upload a table of this kind into the database. > Several tables of that size exist on my database and don't cause any > trouble. I don't know at what point I would be running into constraints of > an aging Mac with 32 GB of memory and a 2 TB hard drive. > > I could comfortably live with what I'm doing now, dividing the data into > three or four frequency ranges. > > Given this information, should I try and create a single table or am I > likely to run into serious constraints if I move beyond my current maximum > table size of 15 million records. > > Perhaps there is no clear answer, and I should just experiment. But if > any reader with more knowledge of Postgres thinks that in my environment I > would be skating on thin ice if I move beyond current limits, I'd be > grateful to be told so. > > > > > > > > > > > Martin Mueller > Professor emeritus of English and Classics > Northwestern University > *From: *Adrian Klaver <[email protected]> > *Date: *Tuesday, May 19, 2026 at 09:45 > *To: *Martin Mueller <[email protected]>; > [email protected] <[email protected]> > *Subject: *Re: scaling up from t1n to 60 million records > > On 5/19/26 7:27 AM, Martin Mueller wrote: > > I use Postgres with a GUI frontend (Aquafold) as a very large > > spreadsheet on steroids that analyzes rare or defective spellings in a > > corpus of 65,000 texts and1.5 billion words. I typically extract data > > from the corpus with python scripts, turn them into tables and load them > > into the database. > > > > > > On my Mac with 32 GB of memory performance is OK with queries that > > typically within seconds extract data rows from tables with up to ten > > million rows. If the result set is large, I suspect that most of time > > machine's time is spent displaying result sets. I have used indexing > > sparingly. While it helps, the time savings often don't matter much. > > This is going to need more information: > > 1) Postgres version. > > 2) The table schema including indexes. > > 3) An example of the query. > > 4) Where you are measuring the time. > > 5) The client you are displaying the results in. > > > > > > > I am thinking about scaling up to table with about 60 million rows. Are > > there things to do or watch out for? Or should I proceed on the > > assumption that that 60 million records are within scope and that the > > added timecost is roughly linear? > > > > Martin Mueller > > > > Professor emeritus of English and Classics > > > > Northwestern University > > > > > -- > Adrian Klaver > [email protected] > -- Death to <Redacted>, and butter sauce. Don't boil me, I'm still alive. <Redacted> lobster!
