Robin Laing posted on Tue, 29 Jul 2025 19:24:42 -0600 as excerpted:

> I have been using pan for over a decade and the last time I tried pan a
> few days ago (Jul 27), it almost crashed my machine by using all the
> available memory, with usage over 16G and the rest of the swap on a
> machine with 32G ram and 8G swap.  Before that it froze up a couple of
> times downloading headers in the large groups for 100 days.
> 
> Would it be worth removing the full .pan2 directory and starting fresh
> to see if problems clear up?
> 
> Groups I look at have a large number of articles.  Due to obfuscation
> usage, they get a very large number, in the 100's of thousands in a few
> months.  There are original posts as well.

So it has been awhile since I did a "big picture" post.  This takes the 
long way around to a direct answer to your question, but hopefully, it 
eventually gets there...

High-level overview: Pan has always stored its "group world-view" in RAM.  
Historically that has at times triggered "large group" scaling issues as 
post volume and server retention competed with current per-app RAM 
capacity.  Various efficiencies have mitigated the problem temporarily, 
but ultimately pan needs to migrate to a "moving window view" in RAM while 
only keeping its overall world-view on-permanent-storage.  Efforts are 
ongoing...

The first problems (I was around for, at least) were seen in the 00s with 
32-bit machines and normal per-app RAM capacity of 1 or 2 gig depending on 
kernel memory model (32-bit addressable was 4-gig but generally half that 
was kernel-space, leaving only 2 gig max addressable in users-space).  
Back then, pan was tracking uncompressed messages using list-widgets, 
server retention tended to be perhaps ten days at best, a few hundred-
thousand messages, and pan would run into memory-capacity issues at ~200K 
messages.

Physical machine coping mechanisms included switching to 64-bit machines 
to kill the 4-gig limit, or on 32-bit, switching to less efficient 
separate kernel/user 4G/4G kernel/userspace models to raise per-app memory 
caps to 4G. (A Linux patch was available but apparently never made it to 
mainline/Linus -- I switched to amd64 in late 2003 and had "cheap" servers 
with lower retention -- sometimes only hours on the biggest groups -- so 
really never personally ran into that limit, but I saw the complaints on-
list.)

Code-wise, then pan maintainer Charles Kerr switched to a more memory-
efficient list-widget and "compressed" some usage by storing common 
strings like frequent poster names or fragments of large post subject 
lines only once and referring to them using shorter symbols, only 
expanding them for display or actual download.

At the pan config level, users could work around the problem by expiring 
"headers" faster if they were on high-retention servers, a technique that 
continues to help today.

Both the coding efficiencies and going 64-bit helped, but only 
temporarily.  Even on 64-bit platforms with 8-16 gig of RAM, high-end for 
that time, pan would at first still run into issues at around half a 
million "headers", tho the memory efficiency techniques helped and I 
remember people reporting pan could now handle ~750K headers before it 
started struggling.

Around that time the Charles introduced the pan rewrite from C to C++ as 
well, and that kept in mind the memory issues.  That and generally 
increasing memory capacities helped for awhile too, but pan would still 
run into scaling issues at around 1-1.2 million headers, tho by this point 
it wasn't so much memory capacity issues per se, but general compute 
inefficiencies in the entire pan approach, and this remains the case 
today, though with a good machines I believe pan can handle perhaps a 
couple million headers now.

Meanwhile, Charles had the idea of redesigning pan's entire approach, 
switching to a database model where pan basically never tracked the whole 
picture at once as it does currently, but fed a database the header 
information and did database queries to get a viewing window on whatever 
information it was trying to deal with at that time.  The database backend 
was going to be sqlite.

Unfortunately, I don't believe Charles was really comfortable with 
database programming and whatever experiments he might have tried he never 
presented publicly.  I think he got frustrated as he ran into issues he 
didn't have the database programming experience to solve, and "real life" 
intruded.  I remember he tried to find another pan maintainer to take 
over, but news/nntp has always been somewhat niche, and nobody stepped 
forward.  Eventually he moved on to other things and pan development was 
effectively abandoned for a few years, tho distro package maintainers 
tried to keep it at least updated/building/running against still 
maintained libraries.

Petr Kovar did eventually step up as upstream pan maintainer for a period, 
but he was actually a gtk/gnome translator not a programmer and while he 
did his best and us users were glad he did, mostly he took distro and user 
patches and applied them upstream, not really doing a whole lot of his own 
development.

Then Heinrich Mueller appeared, and pan had an upstream maintainer that 
could and did implement many new pan features once again.  It was Heinrich 
that finally implemented yenc posting, and Heinrich that implemented the 
long discussed rules integration with scoring, so we could for instance 
configure pan to auto-delete (pre-download) messages that scored to 
"ignored" level, auto-mark-read messages that scored "low" (less than zero 
but not to ignored/-9999), display but not auto-download "headers" for 
normal messages, and auto-download (or cache) "watched" (+9999) messages.

(Two-paragraph connections-limit diversion...)

Heinrich also implemented but then on community request reverted the 
ability to GUI-configure more than four connections per server.  The 
problem was that GNKSA had a rule of max four connections per server, and 
while I think most will agree that seems a bit anachronistic today (pay 
servers usually allow double-digit connections as they want you to use up 
your allotment and buy more, while the remaining free and ISP servers must 
strictly limit connections server-side to prevent abuse, so a client-level 
connection limit seems little relevant), Charles took quite a bit of pride 
in pan being 100% GNKSA compliant including many other factors such as 
quote/reply format, plain text not HTML, etc, and pan's user community has 
tended to likewise place quite some emphasis on that.  But it was 
personally my own fear, and the active on-pan-list community seemed to 
agree, that despite the lack of current relevance of that individual GNKSA 
limit, should pan eliminate that limit and lose its 100% GNKSA compliance, 
it'd be a slippery slope and way too easy for pan to lose many of the 
other GNKSA features that have made pan and its community what it is over 
the decades.

So after an informal on-list vote, Heinrich reverted that change, tho in 
terms of connection limits, it's worth noting that the GNKSA wording does 
have a loophole, which pan does exploit.  While in keeping with GNKSA pan 
(the pan GUI) does not allow more than four connections per server to be 
configured, should a pan user text-edit the pan config to say 5 or 20 or 
50 or whatever connections, pan will indeed honor that.  So that's how to 
get around the PAN and GNKSA limit, should one desire to.  However, it's 
worth noting that pan connections are efficient enough that the number of 
connections is seldom the bottleneck.  Instead, the bottleneck is usually 
the allowed connection speed, either that of the internet connection 
itself, or the allowed server connection speed.  Of course on slow 
machines that can occasionally be the bottleneck as well, but it's very 
seldom that pan's GUI config limit of four connections per server actually 
turns out to be the real bottleneck, so pan's GUI per-server-connections 
limit doesn't turn out to be much of an issue for most in any case.  
Never-the-less, the text-edit config option is there for people who 
/think/ they need more.  

With Heinrich's changes, pan finally could be considered basically feature 
complete, but for two things, one of which I at least don't believe is 
appropriate for pan as a general purpose news client anyway.  This is that 
pan isn't and I don't believe ever will be the best "batch uploader".  
There are other tools for that, and arguably, the workflow for that 
doesn't really match that of the general use news client that pan targets, 
in any case.  But Heinrich did implement general purpose yenc uploading, 
thereby checking off that feature required for a well rounded general-
purpose news client.

The other one is the subject at hand.  But before that, to wrap up a loose 
thread...  While Heinrich did basically "feature-complete" pan, I'm not 
entirely sure what actually happened to him.  Did he simply lose interest 
after that?  Did "real life" happen, say he got a family and didn't really 
have time for pan any more?  Did the always niche-case of news leave his 
life as a factor and he simply didn't have a personal use-case for pan any 
longer so he lost interest?  In any case, while I definitely remember 
Charles /trying/ to find a new upstream maintainer to hand off to, and 
failing, Heinrich was a different story.  He both seemed to appear out of 
nowhere to go great guns on pan for awhile, feature-completing it like I 
said, and disappear into nowhere.

Unfortunately that left pan orphaned without an upstream maintainer again 
for awhile.  Meanwhile, in real life nntp/new's always niche interest 
seemed to diverge even further from that of the mainline, and the distro 
package maintainers that had carried pan through its previous orphan state 
didn't seem so interested now.  Unfortunately, pan, like news/nntp in 
general, has diverged far enough from mainstream interest now that when it 
lost strong upstream maintainership, many distros (Debian being a 
significant exception!) dropped it, and without that, pan was increasingly 
in danger of losing the updates that allowed it to continue to build 
against current libraries, due to lack of (primarily) distro maintainers 
creating and maintaining the necessary patches.

But I DID say Debian was an exception, the very fortunate exception in 
this case, as Dominique Dumont, Debian's pan package maintainer, 
ultimately stepped up to be the upstream pan maintainer as well.  Pan was 
in very real danger of stale-code death when he stepped forward, 
completing and stabilizing the port to gtk3 and other current libraries 
just in time as many distros are now dropping or have already dropped gtk2 
support.  In the process he fixed quite a few bugs, and has actually 
started introducing new pan code once again. =:^)

That brings us back to the subject at hand.  Multi-million-message 
scalability implemented in the form of a database backend, that being 
pan's biggest current challenge and the second of the two remaining yet-
to-implement features.

Charles had talked about it but never implemented it and that was the one 
big feature Heinrich never to my knowledge even attempted, but now 
Dominique's attempting it.  While I personally run live-git pan, DD's 
entirely reasonably doing that in a separate development branch that I've 
not tried, so I don't know current status.  Last he posted, however, he 
was working on implementing the database backend first for some less 
critical stuff, before attempting the real scalability challenge stuff.

He may well post a status update reply here, but meanwhile, what about 
your current problem, using the current pan code?  To finally answer your 
question directly...

Yes, removing ~/.pan2 to clear the problem may be worth it, altho 
personally I'd use a bisect troubleshooting process here, backing up and 
removing only parts of my config and cache to test, seeing where the 
actual problem is.

In particular, tasks.nzb is the current pan task list and could possibly 
be corrupted in a crash.  Removing it should only clear whatever tasks pan 
had queued at its freeze/crash, without killing the existing config, and 
is thus a reasonably safe and limited test.

To try to deal with the scaling issues on multi-million-header groups I'd 
recommend setting as short an expiry as you can reasonably deal with, say 
two or three days if you download daily.  Certainly, on several-hundred-
thousand headers per day groups, I'd try to keep expiry under a week if at 
all possible, as that's going to directly affect pan's memory and 
scalability due to its current "keep the big picture in main memory" 
model.

Also consider cache size.  Pan's default cache size is already quite small 
for big-binary groups and you probably don't want to reduce it further, 
but if you've increased it to multiple gigs you might try something 
smaller.  Depending on your workflow, however, especially if your workflow 
involves a multi-stage download-to-cache before sorting and saving off 
method (as my normal workflow does), you may run into the problem I had -- 
too small a cache so messages end up deleted out of cache before you can 
actually process them.  With that size groups you may have to change such 
a workflow to be more in line with pan's apparently designed workflow of 
saving off binaries directly, without the intermediate local-cache-first 
then sort and only /then/ save selected binaries that I actually prefer.  
Again, with such large/active groups the cache size may directly affect 
pan's ability to scale and too large a cache could be problematic.

And of course you can try to delete the cache (with or without backing it 
up to restore if the removal doesn't help) if you think it's corrupted, 
but with binary groups pan will soon fill it again in any case, and you 
will of course lose the work of already downloading those messages to 
cache for any you're still working on.

Beyond that, crashes could have corrupted files in the groups subdir, 
which is where pan stores the symbol-mapping I mentioned above.  Pan will 
recreate these if necessary, but given its big-picture-in-memory strategy, 
unless a file's corrupted deleting it isn't going to do pan much good or 
save any memory as pan will just have to recreate it.

The newsgroups.* files track various newsgroup related state, and can be 
recreated if necessary, of course with loss of state such as read-message 
tracking, etc.

preferences.xml is the main preferences file, servers.xml is server config 
and newsrc mapping, and of course the newsrc files track group state too.

As I said I'd prefer to back up my pan data dir and test deleting only 
individual files to troubleshoot where the problem is, and the above 
should help with that, but if you prefer you can of course just blow the 
whole thing away and start over, instead.

Meanwhile, one other config-related hint you may find helpful.  ~/.pan2 is 
actually only the default.  If pan find the PAN_HOME variable set in the 
environment it inherits at start, it'll look there instead.  I actually 
use this along with a pan wrapper script to setup multiple pan instances, 
each with its own config.  Here I run separate binary vs text instances, 
for instance, which could be very helpful if you want say much longer text 
group expiry while you're doing as short a binary group expiry as possible 
to work around pan's scaling issues.  Here, I don't expire my text groups 
at all, but obviously that's not practical for most binary groups, 
particularly at the high post volume you're dealing with.  (I have a 
separate test instance too, so state for groups I'm just browsing 
temporarily doesn't end up in my more permanent instance config.) Setup 
multiple separate pan instances and different expiry for each isn't a 
problem! =:^)

This of course assumes you know how to setup such wrappers yourself.  If 
you need help with that, ask.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/pan-users

Reply via email to