Dear colleagues,

There was a question about UTF-8 support by major Whois providers during last 
week's DB-WG session at RIPE88.

During the UTF-8 discussion in December I checked the other RIRs as follows:

LACNIC: only Latin-1 encoded characters are accepted in updates (UTF-8 is 
ignored) but UTF-8 is returned on port 43.
        Example: whois -h whois.lacnic.net PAP12
APNIC: only Latin-1 is returned
        Example: whois -h testwhois.apnic.net YYYYMMDD-MNT

Subsequently I tested the other RIRs to be sure:

ARIN: UTF-8 is supported in the RPSL object and UTF-8 is returned on port 43.
        Example: whois -h whois.arin.net POC SHRYA12-ARIN
AFRINIC: UTF-8 characters are accepted in updates and UTF-8 is returned on port 
43.
        Example: whois -h whois.afrinic.net SHRYANE-MNT

RIPE stores Latin-1 and returns Latin-1 on port 43.

So in summary, 3 RIRs return UTF-8 and 2 RIRs return Latin-1 on port 43.

Regards
Ed Shryane
RIPE NCC



> On 2 May 2024, at 16:02, Edward Shryane <[email protected]> wrote:
> 
> Dear colleagues,
> 
> To follow-up on the UTF-8 discusssion in January, the DB team plans to 
> implement support for UTF-8 in 3 phases:
> 
> (1) Add a flag to allow a client to choose a character set
> 
> In the Whois release 1.112, we have added the "-Z / --charset" query flag to 
> allow clients to specify which character set they expect. The server response 
> will encode RPSL objects using that character set.
> 
> This new flag can already be tested in the RC environment, e.g. the 
> SHRYANE-MNT object contains "remarks:" attributes with non-ASCII (but still 
> latin-1) characters:
> 
>    $ whois -h whois-rc.ripe.net -r shryane-mnt
>    $ whois -h whois-rc.ripe.net -r -Z utf8 shryane-mnt
> 
> This flag has no impact on the default behaviour of the RIPE database. This 
> change only affects port 43, and the default character set remains latin-1.
> 
> This flag will already be useful for example, to capture responses as UTF-8 
> to file or use UTF-8 encoding in your terminal. In future, if the default on 
> port 43 changes to UTF-8, then clients can keep latin-1 by using 
> "-Z/--charset latin1".
> 
> (2) Convert the database schema to UTF-8
> 
> In the following Whois release, the DB team plans to switch the RIPE database 
> schema character set from latin-1 to UTF-8. This will allow Whois to store 
> UTF-8 strings in the database index tables.
> 
> Switching the database schema character set will involve about 1 hour of 
> downtime to Whois updates, and Whois queries will not be affected. We will 
> announce this change in advance.
> 
> This change will have no impact on the default behaviour of the RIPE 
> database. All interfaces will behave as before, and RPSL objects will remain 
> latin-1 encoded internally.
> 
> (3) Allow UTF-8 to be used in RPSL objects
> 
> Once the RIPE database schema supports the UTF-8 character set, the DB team 
> will create a further Whois release that will allow UTF-8 to be used in RPSL 
> objects, in addition to the index tables.
> 
> The default behaviour of the RIPE database will remain the same. All 
> interfaces will behave as before, but RPSL objects will use UTF-8 internally.
> 
> In future, if the DB-WG decides to allow UTF-8 characters in RPSL, the 
> database will already support it.
> 
> Regards
> Ed Shryane
> RIPE NCC
> 
> 
>> On 18 Jan 2024, at 10:34, Edward Shryane <[email protected]> wrote:
>> 
>> Dear colleagues,
>> 
>> Based on the discussion regarding UTF-8 in the RIPE database during the 
>> interim meeting yesterday, I suggest that we implement support for UTF-8 in 
>> the database (i.e. convert the schema and add a flag to allow a client to 
>> choose a character set), but we do not allow additional characters for now, 
>> pending further DB-WG discussion. Our intention is to lay the groundwork for 
>> future support, without breaking existing functionality. If you have any 
>> concerns or objections please let me know.
>> 
>> We will now prepare an implementation plan / impact analysis of these 
>> changes.
>> 
>> Regards
>> Ed Shryane
>> RIPE NCC
>> 
>> 
>>> On 24 Nov 2023, at 10:03, Edward Shryane via db-wg <[email protected]> wrote:
>>> 
>>> Dear colleagues,
>>> 
>>> Currently the RIPE database only allows a subset of ASCII characters in the 
>>> "org-name:", "person:" and "role:" attributes, for a few reasons including:
>>> 
>>> * These attributes are also a look-up key and the Whois protocol does not 
>>> allow specifying character sets in queries.
>>> * RPSL names are ASCII according to RFC2622
>>> * Using a normalised name makes the object easier to query
>>> * Reading a normalised name is easier to interpret
>>> 
>>> However there are some drawbacks to forcing names to only use a subset of 
>>> ASCII characters:
>>> 
>>> * Organisations, roles and persons cannot use their actual name if it 
>>> includes characters outside this subset.
>>> * Normalisation is not standard, but is an interpretation done by each 
>>> maintainer, e.g. characters could be excluded or converted in different 
>>> ways.
>>> 
>>> Since we support the Latin-1 character set in the RIPE database, I propose 
>>> we also allow non-ASCII Latin-1 characters in these attributes.
>>> 
>>> Querying for a name can be done either using the latin-1 characters 
>>> (proposed) or a normalised, ASCII representation (currently). The 
>>> normalised version will be generated by Whois and stored in a database 
>>> index for querying. The primary key will also be generated from the 
>>> normalised version.
>>> 
>>> Please let me know your feedback.
>>> 
>>> Regards
>>> Ed Shryane
>>> RIPE NCC
>>> 
>>> ---
>>> 
>>> Whois attribute verbose description (copied from the help text).
>>> 
>>> org-name
>>> --------
>>> Specifies the name of the organisation that this organisation object
>>> represents in the RIPE Database. This is an ASCII-only text attribute.
>>> The restriction is because this attribute is a look-up key and the
>>> whois protocol does not allow specifying character sets in queries.
>>> The user can put the name of the organisation in non-ASCII character
>>> sets in the "descr:" attribute if required.
>>> 
>>> A list of 1 to 30 words separated by white space. 
>>> A word is made up of ASCII alphanumeric characters and additionally: 
>>> ][)(._"*@,&:!'`+/-
>>> A word may have up to 64 characters and is not case sensitive. 
>>> Each word can have any combination of the above characters with no 
>>> restriction on the start or end of a word.
>>> 
>>> person
>>> ------
>>> Specifies the full name of an administrative, technical or zone
>>> contact person for other objects in the database.
>>> 
>>> It should contain 2 to 10 words.
>>> A word is made up of ASCII alphanumeric characters and additionally: .`'_-
>>> The first word should begin with a letter.
>>> At least one other word should also begin with a letter.
>>> Max 64 characters can be used in each word.
>>> 
>>> role
>>> ----
>>> Specifies the full name of a role entity, e.g. RIPE DBM.
>>> 
>>> A list of 1 to 30 words separated by white space.
>>> A word is made up of ASCII alphanumeric characters and additionally: 
>>> ][)(._"*@,&:!'`+/-
>>> A word may have up to 64 characters and is not case sensitive. 
>>> Each word can have any combination of the above characters with no 
>>> restriction on the start or end of a word.
>>> 
>>> 
>>> -- 
>>> 
>>> To unsubscribe from this mailing list, get a password reminder, or change 
>>> your subscription options, please visit: 
>>> https://lists.ripe.net/mailman/listinfo/db-wg
>> 
> 


-- 

To unsubscribe from this mailing list, get a password reminder, or change your 
subscription options, please visit: 
https://lists.ripe.net/mailman/listinfo/db-wg

Reply via email to