On Sat, Mar 12, 2005 at 10:07:46AM -0500, Jason Gurtz wrote:
On 12-Mar-05 09:29, Hui Zhou wrote:

Reading my own mail (this one that I just sent :)) and I realize that simple token treatment definitely won't work good enough to mark sort my post into interesting (How shameless :). It may work for categorization of regular notifications and alerts, but for general chatting list, something more need to be taken into account. Maybe the the lengh of original post? or proportion of quotes against reply? or average length of sentences?

I think the hard part is really to come up with the heuristics that do the sorting. Beyond that, it's just separating those heuristics into classes that each do the sort. I personally find it harder to come up with regexes that generically match non-spam mail because I seem to

If the regexes is easy or even possible to come up with, procmail should be sufficient.


think more in terms of what I don't want.  Maybe you can take a similar
approach in a hierarchy from "least want to read" to "most want to read"

You may even want to look at something like MIMEDefang which gives you
access via perl to many different message qualities.

I tend to directly access the mail stream and get whatever interest info worth considering.


 Number of
recipients, time it was sent, envelope From:, etc....  That may give you
more options in developing the heuristics and then you can just use it
to add a custom header which procmail will then use for it's sorting job.

No, I am not talking about heuristic method. All heuristic method seem simple-minded and require way more maintainence then initial development. What I am thinking is treating heruistic charasteristics as equivalent to word tokens and apply the statistics. Hopefully, over large amount of sample mails, the filter can figure out some patterns based on statistics.


Thinking of spamassassin, it is an sophiscated, heruistic, rule-based filter. However, to utilize its potential to the best, the users are asked to tune the weight of way too many individule rules. Frankly, I think no expert can get that right unless the subject is scientifically studied on case by case bases, which is to do statistics analyses.

Sounds like an interesting project anyway.

Good to hear.

--
Hui Zhou
--
http://linuxfromscratch.org/mailman/listinfo/lfs-chat
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page

Reply via email to