https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7215

--- Comment #14 from Mark Martinec <[email protected]> ---
Collected a couple of debug log entries produced by idn_to_ascii()
to get some feeling on how successful the conversion is.

Seems the IDN handling can be quite useful as shown in the following
samples. Most of these are in Russian Cyrillic, some are Slovenian
(and I remember some German samples, but can't find them in my
recent logs):

util: idn_to_ascii: converted to ACE (0):
  /www.грузтранском.рф/ -> /www.xn--80afmnkeilbmhk.xn--p1ai/
  /www.отличное-мнение.рф/ -> /www.xn----itbbaldqlgdbdd6c9d.xn--p1ai/
  /www.правильный-директ.рф/ -> /www.xn----7sbgjfpcfnewtnj6a0kpa.xn--p1ai/
  /грамотное-сео.рф/ -> /xn----7sbijb3bhhbdnti.xn--p1ai/
  /грузтранском.рф/ -> /xn--80afmnkeilbmhk.xn--p1ai/
  /frižider.si/ -> /xn--friider-fxb.si/
  /www.žarnice.si/ -> /www.xn--arnice-2pb.si/
  /žarnice.si/ -> /xn--arnice-2pb.si/
  /www.контролируемый-имидж.рф/ ->
/www.xn----htbbggcafgkndfnb5ad5ay6n.xn--p1ai/
  /www.на-отдых-в-сахару.рф/ -> /www.xn------5cddalo2fm2ajiwwf9g.xn--p1ai/
  /www.работа-на-себя.рф/ -> /www.xn-----6kcabde5a3ehtuh4q.xn--p1ai/
  /www.стройметком.рф/ -> /www.xn--e1ahegchekikf.xn--p1ai/
  /делай-деньги.ком.рф/ -> /xn----7sbkbcddzes1a4p.xn--j1aef.xn--p1ai/
  /заказ-грузоперевозок.орг.рф/ ->
/xn----7sbajemakccd1aj5cdblpe9c.xn--c1avg.xn--p1ai/
  /играть-за-деньги.рф/ -> /xn-----6kcbmegiogj2d5a3a4mh.xn--p1ai/
  /идеал-мастер.рф/ -> /xn----7sbbnfdp1ak6bjm.xn--p1ai/
  /курсы-шоуменов.рф/ -> /xn----dtbislhedmkue7dyb.xn--p1ai/
  /люди-и-цифры.рф/ -> /xn-----jlcqbbp0c9as2d0a.xn--p1ai/
  /на-отдых-в-сахару.рф/ -> /xn------5cddalo2fm2ajiwwf9g.xn--p1ai/
  /обучаем-иностранному.рф/ -> /xn----7sbbbvt0adhbachd0aprjp8d.xn--p1ai/
  /плавучая-баня.рф/ -> /xn----7sbabed5dwak7b5b6fe.xn--p1ai/
  /работа-на-себя.рф/ -> /xn-----6kcabde5a3ehtuh4q.xn--p1ai/
  /скайвуд-лиственница.рф/ -> /xn----7sbbgbkjwcdjr3aa2cirm2e.xn--p1ai/
  /такси-московское.орг.рф/ -> /xn----7sbhmmlcbpubc4aede.xn--c1avg.xn--p1ai/


...but there is also plenty of samples which indicate a miserable
failure of the URL extraction code in properly delimiting an URL
from surrounding text when dealing with UTF-8 encoded (normalized)
text:

util: idn_to_ascii: alternative dots normalized:
  /自由。”/ -> /自由.”/

util: idn_to_ascii: conversion to ACE failed (0):
  /他一向令女人神魂颠倒的抚摸,就真的那么令她讨厌吗?/
  /www.EPChinaShow.com&t=China’s Largest and Most Authorized Electric Power
Exhibition/
  /经五年没有见到你了,求求你了妈妈,陪陪我,好不好?”/

util: idn_to_ascii: converted to ACE (0):
  /www.eme2015.org】/ -> /www.eme2015.xn--org-003b/
  /www.pdma.org)会员/ -> /www.pdma.xn--org)-ye6ft1z/
  /www.uradni-list.si•/ -> /www.uradni-list.xn--si-g3t/
  /#IUS_INFO_ČISTOPISI/ -> /xn--#ius_info_istopisi-3gc/
  /#STROKOVNI_ČLANKI/ -> /xn--#strokovni_lanki-27b/
  /179英文.files/ -> /xn--179-4p8fh21k.files/
  /kjn.uradni-list.si / -> /kjn.uradni-list.si /
  /t.c…”/ -> /t.xn--c...-jb7a/
  /www.WaterNexus.net│[email protected]/ ->
/[email protected]/
  /www.adatours.com / -> /www.adatours.com /
  /www.aijssnet.com With/ -> /www.aijssnet.com with/
  /www.aloftcupertino.com / -> /www.aloftcupertino.com /
  /www.defensedaily.com that/ -> /www.defensedaily.com that/
  /www.disclaimer-uk.wur.nl / -> /www.disclaimer-uk.wur.nl /
  /www.eme2015.org】/ -> /www.eme2015.xn--org-003b/
  /www.hotchips.org   For/ -> /www.hotchips.org for/
  /www.hoti.org EarlyRegistration/ -> /www.hoti.org earlyregistration/
  /www.laboratory-journal.com”/ -> /www.laboratory-journal.xn--com-9o0a/
  /www.ontoresinc.com)provides/ -> /www.ontoresinc.com)provides/
  /www.particulars.eu”/ -> /www.particulars.xn--eu-02t/
  /www.pdma.org)会员/ -> /www.pdma.xn--org)-ye6ft1z/
  /www.sunseeker.deÂ/ -> /www.sunseeker.xn--de-qia/
  /Õ÷¸åº¯Ó¢ÎÄ1.files/ -> /xn-- o 1-7ea01bd9ezbpw814d5ma.files/
  /”http/ -> /xn--http-fb7a/
  /中文11.files/ -> /xn--11-py2cs33g.files/
  /自由.”/ -> /xn--sny74y.xn--ivg/
  /ó/ -> /xn--kda/

Seems the URL extraction code is an area that calls for much more
love in the near future. Properly recognizing Unicode delimiters
is one obvious defect, but a trickier one is probably dealing with
recognizing boundaries in Chinese, Japanese, and Korean writing.
Seems it would be valuable to reach contributors of the project:
  http://emaillab.jp/spamassassin/ja-patch/

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to