Re: [Pan-users] Filtering

Duncan Fri, 16 Sep 2016 14:45:18 -0700

DLSauers posted on Fri, 16 Sep 2016 14:18:38 +0000 as excerpted:

> In one particular group I haunt the alot of cruft gets crossposted in
> for non related topics...
> 
> I heavily filter this group, but could probably gut down on adding
> filters daily and/or the existing ones if I could just get pan to filter
> out things.
> 
> What I am after is something along...
> 
> Lets say the group is x
> 
> If the post has MORE than group X and contains *.politics.* etc... mark
> it -99999999999999999999999999999999999999999999999999999 or what ever..
> 
> None of the options for scoring rules seems to allow this or work, the
> only way to filter this stuff is set up a lot of rules like
> 
> contains Hillary contains Trump contains gay contains ......
> 
> Just being able to if post is xposted to more than 1 group ie X mark it
> -9999 would nuke a lot of stuff....
> 
> Or is pan not able to setup such advanced scoring filters via the GUI
> and/
> or otherwise????
> 
> This group is rather problematic, and always has been.. It has the
> biggest fitlering/killfile and well the only filtering and killfil I've
> used on Usenet in 30+ years!
> 
> Any hints on getting more advanced filtering done???


First the general stuff, since you didn't indicate whether you knew this 
already yet or not, but might, if you're a list regular, as I've posted 
it here many times over the years, tho you likely won't otherwise unless 
you've used the other clients previously and looked at the scorefile 
itself, comparing it with that of the other clients.

Pan's scorefile format is in general a less advanced implementation of 
SLRN's scorefile format (without the fancy stuff such as includes...):

http://slrn.sourceforge.net/docs/score.txt

... but with the case insensitivity (but not the other changes) of xnews 
(my link for that one is dead, but slrn is primary, so it's not worth 
trying to google or otherwise resurrect the xnews one).

Here's the abridged version of the format description I keep as comments 
in my own scorefile:

% [newsgroup.*] wildcard (not regex) format (~ negates).
% header lines regex. (~ negates).
% Score conditions, single : and, double :: or.
% Expires: immed. below score if present.
% Leading % indicates comment
% Leading whitespace and blank lines ignored.
% Regex and newsgroup matches case insensitive with
% keyword:, sensitive with keyword=.
% Newsgroup change delimits section,
% Score delimits "rule", multiple rules per section allowed.
% Comment after score becomes rule "name".

% Score levels: <=-9999 kill, -9998 to -1 low,
%               0, 1 - 4999 med, 5000 - 9998 high, >=9999 watch


** EXCEPT: Unfortunately the last time I investigated, pan's scoring had 
a bug, and would **NOT** do logical AND -- the single : was treated as OR 
(::) regardless.  Fortunately, most of my scoring (and I guess pretty 
much anyone elses) is single-shot OR logic anyway, so that's not as big 
of a deal as if OR logic were broken instead of AND, but it /does/ rather 
kill a direct implementation of your AND test above... if the bug still 
exists, which I suppose it does but haven't recently tested.

However, it's /somewhat/ possible to work around that limitation by 
judicious use of additive scoring -- as an example, use two rules that 
each set -5000, so they combine to -10000 and trigger the kill level.  
(Tho if you have other rules that add say 100 and a message triggers them 
as well it'll end up at -9900 and not trigger kill, but that's a good 
thing as it makes it far more flexible, just make the two -9998 each so 
each one /almost/ kills, and any trivial +100s won't undo the kill of 
both combined, if you want that, or make them both -4950 if you want a 
trivial -100 to be necessary as well to kill, or...)


The other thing that should stick out as pretty important from the above 
rules, once you understand a leading % indicates a comment, when looking 
at the rules pan creates if you use its gui to create rules, is that:

** Most of the lines pan adds to the scorefile are simply extra 
explanatory comments -- they don't actually affect the rules at all and 
deleting many of them can help massively shrink your scorefile without 
affecting actual scorefile logic at all.


Finally, if you've been using pan's GUI to create most of your scores and 
haven't edited or have only lightly edited the scorefile itself, and you 
do a LOT of scoring, you should be able to *greatly* optimize things with 
some rather more active manual scorefile formatting and editing.  For 
instance, a short excerpt from the alt.* spam-kill section of my own 
scorefile:

Warning, adult themed example!

%#####################################################################
%#####################################################################
[alt.*]
Score:: =-9999 %Alt kill
        From: Seeking teens
        From: teens seeker
        From: ^LoLiTa <
        From: ^GOBLIN <
        From: sex coed
        From: NudeGirls
        From: voyeur only
        From: amateur
        From: SEXmag
        From: teens
        From: intermixed
        From: rectal

        Subject: adult movies
        Subject: dupped
        Subject: ^\([-0-9/]*\)
        Subject: Use critical pack from Microsoft Corporation
        Subject: R/-\\PE
        Subject: R/-\|PE
        Subject: Horny mom
        Subject: rectal exam
        Subject: body cavity
        Subject: mature women
        Subject: candid voyeur
        

Just imagine how many lines that would take if they were each 
individually added as separate rules, complete with multiple comment 
lines each, by pan's GUI.  Here, they're both easily human-read, and far 
easier and more efficient for pan to parse.

The down side to this level of scorefile editing, of course, is that in 
ordered to maintain it, you pretty much have to either add new entries 
manually, or pretty regularly go in and reoptimize all the entries you've 
added via the pan GUI since the last time you cleaned up.

The up side is of course that once you have it cleaned up, it's dead easy 
to manually add an additional single-line entry.

Meanwhile, a few hints:

* Set a pan hotkey for the articles, edit article's watch/ignore score, 
function.  From there you can hit the close and rescore button, to rescore 
based on any manual edits you just made to the scorefile.  That's the 
easiest way to get pan to reapply freshly manually edited scores I've 
come up with.

* Use %#### or similar comment lines to visually separate sections, as I 
did in the example above.

* Consider whether you want an expiring or permanent score.  Permanent 
scores can be easily added to the nicely edited groups manually, while 
it's tougher to group expiring scores since the expires line will differ, 
so adding these via the pan gui works well enough.

* Consider adding a %### separator line or two at the bottom of your 
permanent scores, so pan can append the expiring scores you add via the 
GUI, and it's easier to go in and clean up later since you know where the 
new ones start.  Talking about which...

* Pan doesn't clean up expired scores on its own.  You'll have to go thru 
and weed them out once in awhile.  (After doing so a few times, you may 
find yourself not adding so many expiring scores, choosing instead to 
either add a permanent one or simply skip it, so you don't have to clean 
up the expired score later.  But if you're like me you'll still add a 
few, for people irritating enough to want to score down temporarily, but 
who you think might still learn some maturity, in say a year or so, so 
you don't want to make it permanent just yet.)

* For expiring scores, I've found it helpful to keep pan's "created by 
Pan on <date>" comments, as that way I not only know when it expires, but 
I know when it was created, and thus have some idea of how irritated I 
was when I created the entry, based on how long I set it to last before 
expiring.

*** Pan can score based on any header, not just the ones the GUI allows 
you to score.  However, headers that aren't in the overviews as sent from 
the server won't apply until the message is actually downloaded to cache, 
making them much less efficient since you won't be able to see the effect 
until the message is already downloaded and in cache.  That's a 
limitation of the protocol (and overviews) that pan can't do anything 
about, but sometimes, having to download a message before it can be 
killed is still better than having to actually read it.

*** The above should let you manually add scores based on either the 
newsgroups header (as opposed to the newsgroup you're actually in at the 
time, the [*] section head specifier), or the xrefs header, both of which 
will contain the list of cross-posted groups (the xrefs header only 
listing the ones carried on that server, along with the message number 
for the message in each of those groups, the newsgroups header listing 
all the groups the message was posted to, regardless of whether that 
server carries them or not).  However, I'm not sure whether these rules 
will apply before or after download, due to the above mentioned overviews 
issue.


Those last two hints should allow you to score based on crossposting to 
N+ groups, provided you know enough about the crossposted group names in 
advance to create a score for them.  Alternatively, scoring on xref and 
counting the number of colons should allow you to score on a message 
posted to N+ groups regardless of name, provided the server carries that 
many of the groups and thus crossposts the message to them.   But again, 
I'd not know for sure without actually testing it, whether such scores 
could be applied before download, with only the overviews information 
available, or if they could only be applied after download.  Either way, 
it should be possible, but one will obviously be far more convenient than 
the other.

And again, as I said above, tho I believe the AND logic bug will prevent 
combining both an N newsgroups and a subject line filter into one, 
requiring both, by using multiple scoring rules and adjusting the scores 
applied by each, you should be able to approximate the same thing.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
Pan-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/pan-users

Re: [Pan-users] Filtering

Reply via email to