https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #9 from AXB <[email protected]>  2009-02-07 12:45:22 PST ---
> (in reply to comment #5)
> Is the idea to accept anything that begins with "http://"; as a URL? I would
> like to have some idea as to how many false positives that leads to -- Not FPs
> on spam detection, although that is important too, but for this, how many 
> false
> identification of strings as URLs and how many resulting unnecessary calls to
> URIRBLs? The reason for the current URI parse code (in trunk -- I'm still
> waiting for that one more review and vote to put it in the 3.2 branch) is to
> only send to the RBL what are possibly real links.

- Its not supposed to trigger any queries
- Its not supposed to be used to mark spam or ham, so FPS are not an issue.
- It IS supposed to check if what the parser thinks is a tld, exists in the tld
data or not.

if URL is example.comm and ".comm" IS NOT in known tld list return 0
if URL is example.com and ".com" IS in known tld list return 1

make the 0 available to a rule.

nothing else.

> Which brings up another point. Is health.sharpdecimal as opposed to
> health.sharpdecimal.com in the RBLs anyway? 

the URIBLs depend on SA's or other tld tables to list a domain.
If its an unknown tld it won't be listed.

health.sharpdecimal won't ever be listed unless someone starts listing these
types.
No sober BL op I know of would do this :-)


> If not, what would be the point of  parsing it as a URL?

- to detect if domain is in the known tld list
- to create custom URI rules to detect stuff which won't ever be listed but
needs scoring (positive or negative, whatever may apply)
- if its a new/obscure/frequent URI ending, add a util_rb_2tld entry to allow
SA to parse it as known tld


-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to