Thanks Thomas, for the info and explanation, that makes sense.

 

One question though, I'm trying to understand the difference between spammy
and hammy entries in the database, so I did the following query:

 

assp=# select * from hmmdb where pkey like '%testosterone%';

                                        pkey
|  pvalue   | pfrozen

----------------------------------------------------------------------------
---------+-----------+---------

 testosterone\x1Cand\x1Cfeel\x1Cstrong\x1Cssub
| 0.9999999 |       0

 @macmedics.com\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong
| 0.9999999 |       0

 ssub\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand
| 0.9999999 |       0

 @macmedics.com\x1Cssub\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand
| 0.9999999 |       0

 @macmedics.com\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel
| 0.9999999 |       0

 free\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong
| 0.9999999 |       0

 
@macmedics.com\x1C98d6915738f9d2b8e981c34b\x1Cssub\x1Cboost\x1Cfree\x1Ctesto
sterone | 0.9999999 |       0

 boost\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel
| 0.9999999 |       0

 98d6915738f9d2b8e981c34b\x1Cssub\x1Cboost\x1Cfree\x1Ctestosterone
| 0.9999999 |       0

 @macmedics.com\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong\x1Cssub
| 0.9999999 |       0

(10 rows)

 

These spammy entries look identical to the disclaimers which you apparently
were saying were corrected-notspam.  Sorry, I apparently don't know enough
Perl to figure out how the code is dealing with this, are these entries
simply in a different section of the database or does each entry in fact
contain enough info to identify whether it is a spam or ham word?  When I
just dump the database, spam and ham entries appear to be together so it
appears to be the latter.

 

Thanks again,

 

- Phil

 

 


Re: <https://sourceforge.net/p/assp/mailman/message/36699848/>  [Assp-user]
Disclaimers not being removed?

From: Thomas Eckardt  - 2019-06-22 07:16:34 

Attachments: Message
<https://sourceforge.net/p/assp/mailman/attachment/tITC.207679798c.OF3B47793
A.C5717FD8-ONC1258421.0023541F-C1258421.0027F2FD%40thockar.com/1/>  as HTML 


>I also noticed the regex had truncated words in some but not all cases so 
I fixed that
 
ASSP_WordStem.pm is installed and used -> word stemming is done and 
stop-words are removed. Any try to "fix" this, is wrong!
If the disclamer is not stemmed in the mail - another language was 
detected for the mail. There is nothing you can (and should) fix.
 
The disclamer-definition and every mail are processed as follows:
 
- remove all special characters and spaces
- detect the language
- stem all words according to the detected language
 
Another way to make sure the disclamer is ignored by assp, is to compose 
one or more faked mails, which contains only disclaimers (possibly 
multiple times).
Put them in the oposit correction folder. 
 
 companyname\x1Cis\x1Can\x1Ciphone\x1Cpowered                   | 
0.9999999 |       0
 
(here this would be corrected-notspam)
 
Make sure the MD5 hash of the body is different in all these mails.
Remove the disclamer-definition.
 
The discalimer content will get a weight of 0.4<>0.6 and will not be 
stored in the databases. Or it will get a weight <=0.4 and will be 
detected as good.
 
Thomas

 



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
_______________________________________________
Assp-user mailing list
Assp-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-user

Reply via email to