Hi Offray, > On 27 Jul 2018, at 12:39, Offray Vladimir Luna Cárdenas > <[email protected]> wrote: > > Hi, > > I was ready to show a friend the Pharo web capabilities with the > classical "myString asUrl retrieveContents", but the friend gave me a > url that contains non Latin characters[1] and then I got an > ZnInvalidUTF8 error. > > [1] > http://www.bidchance.com/freesearch.do?&filetype=&channel=¤tpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment= > > How can I process web addresses in Pharo that contain non latin > characters like the one in [1]?
I am on holiday, so I cannot go too deep into this, but AFAIU the URL is wrong (or it assumes a specific context with a non-standard encoding). In a URL's query part, non-ASCII data is first UTF-8 encoded, then percent encoded (this is the modern way). I don't read Chinese, so it is hard to infer much from the original site, but I am assuming the search is for '喀什', a city called Kashgar, https://en.wikipedia.org/wiki/Kashgar_(disambiguation). The string in question can be written as (to avoid copy/paste problems): String with: 21888 asCharacter with: 20160 asCharacter. The encoding in a URL has to be: ZnPercentEncoder new encode: (String with: 21888 asCharacter with: 20160 asCharacter). This gives us for example the following URL: 'https://www.google.com/search?q=%E5%96%80%E4%BB%80' asUrl. Which parses OK and contains the correct encoded string (decoded in the URL object): 'https://www.google.com/search?q=%E5%96%80%E4%BB%80' asUrl queryAt: #q. If you copy/paste that URL in your browser it should resolve to stuff about Kashgar. Obviously the website www.bidchance.com does something else (non-standard ?). HTH, Sven > Thanks, > > Offray
