Dear colleagues,
There was a question about UTF-8 support by major Whois providers during last
week's DB-WG session at RIPE88.
During the UTF-8 discussion in December I checked the other RIRs as follows:
LACNIC: only Latin-1 encoded characters are accepted in updates (UTF-8 is
ignored) but UTF-8 is returned on port 43.
Example: whois -h whois.lacnic.net PAP12
APNIC: only Latin-1 is returned
Example: whois -h testwhois.apnic.net YYYYMMDD-MNT
Subsequently I tested the other RIRs to be sure:
ARIN: UTF-8 is supported in the RPSL object and UTF-8 is returned on port 43.
Example: whois -h whois.arin.net POC SHRYA12-ARIN
AFRINIC: UTF-8 characters are accepted in updates and UTF-8 is returned on port
43.
Example: whois -h whois.afrinic.net SHRYANE-MNT
RIPE stores Latin-1 and returns Latin-1 on port 43.
So in summary, 3 RIRs return UTF-8 and 2 RIRs return Latin-1 on port 43.
Regards
Ed Shryane
RIPE NCC
> On 2 May 2024, at 16:02, Edward Shryane <[email protected]> wrote:
>
> Dear colleagues,
>
> To follow-up on the UTF-8 discusssion in January, the DB team plans to
> implement support for UTF-8 in 3 phases:
>
> (1) Add a flag to allow a client to choose a character set
>
> In the Whois release 1.112, we have added the "-Z / --charset" query flag to
> allow clients to specify which character set they expect. The server response
> will encode RPSL objects using that character set.
>
> This new flag can already be tested in the RC environment, e.g. the
> SHRYANE-MNT object contains "remarks:" attributes with non-ASCII (but still
> latin-1) characters:
>
> $ whois -h whois-rc.ripe.net -r shryane-mnt
> $ whois -h whois-rc.ripe.net -r -Z utf8 shryane-mnt
>
> This flag has no impact on the default behaviour of the RIPE database. This
> change only affects port 43, and the default character set remains latin-1.
>
> This flag will already be useful for example, to capture responses as UTF-8
> to file or use UTF-8 encoding in your terminal. In future, if the default on
> port 43 changes to UTF-8, then clients can keep latin-1 by using
> "-Z/--charset latin1".
>
> (2) Convert the database schema to UTF-8
>
> In the following Whois release, the DB team plans to switch the RIPE database
> schema character set from latin-1 to UTF-8. This will allow Whois to store
> UTF-8 strings in the database index tables.
>
> Switching the database schema character set will involve about 1 hour of
> downtime to Whois updates, and Whois queries will not be affected. We will
> announce this change in advance.
>
> This change will have no impact on the default behaviour of the RIPE
> database. All interfaces will behave as before, and RPSL objects will remain
> latin-1 encoded internally.
>
> (3) Allow UTF-8 to be used in RPSL objects
>
> Once the RIPE database schema supports the UTF-8 character set, the DB team
> will create a further Whois release that will allow UTF-8 to be used in RPSL
> objects, in addition to the index tables.
>
> The default behaviour of the RIPE database will remain the same. All
> interfaces will behave as before, but RPSL objects will use UTF-8 internally.
>
> In future, if the DB-WG decides to allow UTF-8 characters in RPSL, the
> database will already support it.
>
> Regards
> Ed Shryane
> RIPE NCC
>
>
>> On 18 Jan 2024, at 10:34, Edward Shryane <[email protected]> wrote:
>>
>> Dear colleagues,
>>
>> Based on the discussion regarding UTF-8 in the RIPE database during the
>> interim meeting yesterday, I suggest that we implement support for UTF-8 in
>> the database (i.e. convert the schema and add a flag to allow a client to
>> choose a character set), but we do not allow additional characters for now,
>> pending further DB-WG discussion. Our intention is to lay the groundwork for
>> future support, without breaking existing functionality. If you have any
>> concerns or objections please let me know.
>>
>> We will now prepare an implementation plan / impact analysis of these
>> changes.
>>
>> Regards
>> Ed Shryane
>> RIPE NCC
>>
>>
>>> On 24 Nov 2023, at 10:03, Edward Shryane via db-wg <[email protected]> wrote:
>>>
>>> Dear colleagues,
>>>
>>> Currently the RIPE database only allows a subset of ASCII characters in the
>>> "org-name:", "person:" and "role:" attributes, for a few reasons including:
>>>
>>> * These attributes are also a look-up key and the Whois protocol does not
>>> allow specifying character sets in queries.
>>> * RPSL names are ASCII according to RFC2622
>>> * Using a normalised name makes the object easier to query
>>> * Reading a normalised name is easier to interpret
>>>
>>> However there are some drawbacks to forcing names to only use a subset of
>>> ASCII characters:
>>>
>>> * Organisations, roles and persons cannot use their actual name if it
>>> includes characters outside this subset.
>>> * Normalisation is not standard, but is an interpretation done by each
>>> maintainer, e.g. characters could be excluded or converted in different
>>> ways.
>>>
>>> Since we support the Latin-1 character set in the RIPE database, I propose
>>> we also allow non-ASCII Latin-1 characters in these attributes.
>>>
>>> Querying for a name can be done either using the latin-1 characters
>>> (proposed) or a normalised, ASCII representation (currently). The
>>> normalised version will be generated by Whois and stored in a database
>>> index for querying. The primary key will also be generated from the
>>> normalised version.
>>>
>>> Please let me know your feedback.
>>>
>>> Regards
>>> Ed Shryane
>>> RIPE NCC
>>>
>>> ---
>>>
>>> Whois attribute verbose description (copied from the help text).
>>>
>>> org-name
>>> --------
>>> Specifies the name of the organisation that this organisation object
>>> represents in the RIPE Database. This is an ASCII-only text attribute.
>>> The restriction is because this attribute is a look-up key and the
>>> whois protocol does not allow specifying character sets in queries.
>>> The user can put the name of the organisation in non-ASCII character
>>> sets in the "descr:" attribute if required.
>>>
>>> A list of 1 to 30 words separated by white space.
>>> A word is made up of ASCII alphanumeric characters and additionally:
>>> ][)(._"*@,&:!'`+/-
>>> A word may have up to 64 characters and is not case sensitive.
>>> Each word can have any combination of the above characters with no
>>> restriction on the start or end of a word.
>>>
>>> person
>>> ------
>>> Specifies the full name of an administrative, technical or zone
>>> contact person for other objects in the database.
>>>
>>> It should contain 2 to 10 words.
>>> A word is made up of ASCII alphanumeric characters and additionally: .`'_-
>>> The first word should begin with a letter.
>>> At least one other word should also begin with a letter.
>>> Max 64 characters can be used in each word.
>>>
>>> role
>>> ----
>>> Specifies the full name of a role entity, e.g. RIPE DBM.
>>>
>>> A list of 1 to 30 words separated by white space.
>>> A word is made up of ASCII alphanumeric characters and additionally:
>>> ][)(._"*@,&:!'`+/-
>>> A word may have up to 64 characters and is not case sensitive.
>>> Each word can have any combination of the above characters with no
>>> restriction on the start or end of a word.
>>>
>>>
>>> --
>>>
>>> To unsubscribe from this mailing list, get a password reminder, or change
>>> your subscription options, please visit:
>>> https://lists.ripe.net/mailman/listinfo/db-wg
>>
>
--
To unsubscribe from this mailing list, get a password reminder, or change your
subscription options, please visit:
https://lists.ripe.net/mailman/listinfo/db-wg