Hi Offray,

> On 27 Jul 2018, at 12:39, Offray Vladimir Luna Cárdenas 
> <[email protected]> wrote:
> 
> Hi,
> 
> I was ready to show a friend the Pharo web capabilities with the
> classical "myString asUrl retrieveContents", but the friend gave me a
> url that contains non Latin characters[1] and then I got an
> ZnInvalidUTF8 error.
> 
> [1]
> http://www.bidchance.com/freesearch.do?&filetype=&channel=&currentpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
> 
> How can I process web addresses in Pharo that contain non latin
> characters like the one in [1]?

I am on holiday, so I cannot go too deep into this, but AFAIU the URL is wrong 
(or it assumes a specific context with a non-standard encoding).

In a URL's query part, non-ASCII data is first UTF-8 encoded, then percent 
encoded (this is the modern way).

I don't read Chinese, so it is hard to infer much from the original site, but I 
am assuming the search is for '喀什', a city called Kashgar, 
https://en.wikipedia.org/wiki/Kashgar_(disambiguation).

The string in question can be written as (to avoid copy/paste problems):

  String with: 21888 asCharacter with: 20160 asCharacter.

The encoding in a URL has to be:

  ZnPercentEncoder new encode: (String with: 21888 asCharacter with: 20160 
asCharacter).

This gives us for example the following URL:

  'https://www.google.com/search?q=%E5%96%80%E4%BB%80' asUrl.

Which parses OK and contains the correct encoded string (decoded in the URL 
object):

  'https://www.google.com/search?q=%E5%96%80%E4%BB%80' asUrl queryAt: #q.

If you copy/paste that URL in your browser it should resolve to stuff about 
Kashgar.

Obviously the website www.bidchance.com does something else (non-standard ?).

HTH,

Sven

> Thanks,
> 
> Offray


Reply via email to