Hello community, here is the log from the commit of package python-idna for openSUSE:Factory checked in at 2017-11-12 17:59:35 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/python-idna (Old) and /work/SRC/openSUSE:Factory/.python-idna.new (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-idna" Sun Nov 12 17:59:35 2017 rev:4 rq:540455 version:2.6 Changes: -------- --- /work/SRC/openSUSE:Factory/python-idna/python-idna.changes 2017-09-23 21:32:48.574278873 +0200 +++ /work/SRC/openSUSE:Factory/.python-idna.new/python-idna.changes 2017-11-12 17:59:42.694577152 +0100 @@ -1,0 +2,15 @@ +Thu Nov 9 18:53:55 UTC 2017 - [email protected] + +- update to version 2.6: + * Allows generation of IDNA and UTS 46 table data for different + versions of Unicode, by deriving properties directly from Unicode + data. + * Ability to generate RFC 5892/IANA-style table data + * Diagnostic output of IDNA-related Unicode properties and derived + calculations for a given codepoint + * Support for idna.__version__ to report version + * Support for idna.idnadata.__version__ and + idna.uts46data.__version__ to report Unicode version of underlying + IDNA and UTS 46 data respectively. + +------------------------------------------------------------------- Old: ---- idna-2.5.tar.gz New: ---- idna-2.6.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python-idna.spec ++++++ --- /var/tmp/diff_new_pack.0TjuST/_old 2017-11-12 17:59:44.914496294 +0100 +++ /var/tmp/diff_new_pack.0TjuST/_new 2017-11-12 17:59:44.914496294 +0100 @@ -18,7 +18,7 @@ %{?!python_module:%define python_module() python-%{**} python3-%{**}} Name: python-idna -Version: 2.5 +Version: 2.6 Release: 0 Summary: Internationalized Domain Names in Applications (IDNA) License: BSD-3-Clause ++++++ idna-2.5.tar.gz -> idna-2.6.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/HISTORY.rst new/idna-2.6/HISTORY.rst --- old/idna-2.5/HISTORY.rst 2017-03-07 04:25:38.000000000 +0100 +++ new/idna-2.6/HISTORY.rst 2017-08-08 05:42:40.000000000 +0200 @@ -3,6 +3,20 @@ History ------- +2.6 (2017-08-08) +++++++++++++++++ + +- Allows generation of IDNA and UTS 46 table data for different + versions of Unicode, by deriving properties directly from + Unicode data. +- Ability to generate RFC 5892/IANA-style table data +- Diagnostic output of IDNA-related Unicode properties and + derived calculations for a given codepoint +- Support for idna.__version__ to report version +- Support for idna.idnadata.__version__ and + idna.uts46data.__version__ to report Unicode version of + underlying IDNA and UTS 46 data respectively. + 2.5 (2017-03-07) ++++++++++++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/PKG-INFO new/idna-2.6/PKG-INFO --- old/idna-2.5/PKG-INFO 2017-03-07 04:27:22.000000000 +0100 +++ new/idna-2.6/PKG-INFO 2017-08-08 05:43:08.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 1.1 Name: idna -Version: 2.5 +Version: 2.6 Summary: Internationalized Domain Names in Applications (IDNA) Home-page: https://github.com/kjd/idna Author: Kim Davies @@ -171,6 +171,41 @@ when the codepoint is illegal based on its positional context (i.e. it is CONTEXTO or CONTEXTJ but the contextual requirements are not satisfied.) + Building and Diagnostics + ------------------------ + + The IDNA and UTS 46 functionality relies upon pre-calculated lookup tables for + performance. These tables are derived from computing against eligibility criteria + in the respective standards. These tables are computed using the command-line + script ``tools/idna-data``. + + This tool will fetch relevant tables from the Unicode Consortium and perform the + required calculations to identify eligibility. It has three main modes: + + * ``idna-data make-libdata``. Generates ``idnadata.py`` and ``uts46data.py``, + the pre-calculated lookup tables using for IDNA and UTS 46 conversions. Implementors + who wish to track this library against a different Unicode version may use this tool + to manually generate a different version of the ``idnadata.py`` and ``uts46data.py`` + files. + + * ``idna-data make-table``. Generate a table of the IDNA disposition + (e.g. PVALID, CONTEXTJ, CONTEXTO) in the format found in Appendix B.1 of RFC + 5892 and the pre-computed tables published by `IANA <http://iana.org/>`_. + + * ``idna-data U+0061``. Prints debugging output on the various properties + associated with an individual Unicode codepoint (in this case, U+0061), that are + used to assess the IDNA and UTS 46 status of a codepoint. This is helpful in debugging + or analysis. + + The tool accepts a number of arguments, described using ``idna-data -h``. Most notably, + the ``--version`` argument allows the specification of the version of Unicode to use + in computing the table data. For example, ``idna-data --version 9.0.0 make-libdata`` + will generate library data against Unicode 9.0.0. + + Note that this script requires Python 3, but all generated library data will work + in Python 2.6+. + + Testing ------- diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/README.rst new/idna-2.6/README.rst --- old/idna-2.5/README.rst 2017-03-07 04:22:47.000000000 +0100 +++ new/idna-2.6/README.rst 2017-08-08 05:42:40.000000000 +0200 @@ -163,6 +163,41 @@ when the codepoint is illegal based on its positional context (i.e. it is CONTEXTO or CONTEXTJ but the contextual requirements are not satisfied.) +Building and Diagnostics +------------------------ + +The IDNA and UTS 46 functionality relies upon pre-calculated lookup tables for +performance. These tables are derived from computing against eligibility criteria +in the respective standards. These tables are computed using the command-line +script ``tools/idna-data``. + +This tool will fetch relevant tables from the Unicode Consortium and perform the +required calculations to identify eligibility. It has three main modes: + +* ``idna-data make-libdata``. Generates ``idnadata.py`` and ``uts46data.py``, + the pre-calculated lookup tables using for IDNA and UTS 46 conversions. Implementors + who wish to track this library against a different Unicode version may use this tool + to manually generate a different version of the ``idnadata.py`` and ``uts46data.py`` + files. + +* ``idna-data make-table``. Generate a table of the IDNA disposition + (e.g. PVALID, CONTEXTJ, CONTEXTO) in the format found in Appendix B.1 of RFC + 5892 and the pre-computed tables published by `IANA <http://iana.org/>`_. + +* ``idna-data U+0061``. Prints debugging output on the various properties + associated with an individual Unicode codepoint (in this case, U+0061), that are + used to assess the IDNA and UTS 46 status of a codepoint. This is helpful in debugging + or analysis. + +The tool accepts a number of arguments, described using ``idna-data -h``. Most notably, +the ``--version`` argument allows the specification of the version of Unicode to use +in computing the table data. For example, ``idna-data --version 9.0.0 make-libdata`` +will generate library data against Unicode 9.0.0. + +Note that this script requires Python 3, but all generated library data will work +in Python 2.6+. + + Testing ------- diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/idna/__init__.py new/idna-2.6/idna/__init__.py --- old/idna-2.5/idna/__init__.py 2017-03-07 04:22:47.000000000 +0100 +++ new/idna-2.6/idna/__init__.py 2017-06-28 16:45:32.000000000 +0200 @@ -1 +1,2 @@ +from .package_data import __version__ from .core import * diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/idna/idnadata.py new/idna-2.6/idna/idnadata.py --- old/idna-2.5/idna/idnadata.py 2017-03-07 04:22:47.000000000 +0100 +++ new/idna-2.6/idna/idnadata.py 2017-08-08 05:42:40.000000000 +0200 @@ -1,5 +1,6 @@ -# This file is automatically generated by build-idnadata.py +# This file is automatically generated by tools/idna-data +__version__ = "6.3.0" scripts = { 'Greek': ( 0x37000000374, diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/idna/package_data.py new/idna-2.6/idna/package_data.py --- old/idna-2.5/idna/package_data.py 1970-01-01 01:00:00.000000000 +0100 +++ new/idna-2.6/idna/package_data.py 2017-06-28 16:26:38.000000000 +0200 @@ -0,0 +1,2 @@ +__version__ = '2.6' + diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/idna/uts46data.py new/idna-2.6/idna/uts46data.py --- old/idna-2.5/idna/uts46data.py 2017-03-07 04:22:47.000000000 +0100 +++ new/idna-2.6/idna/uts46data.py 2017-08-08 05:42:40.000000000 +0200 @@ -1,9 +1,10 @@ -# This file is automatically generated by tools/build-uts46data.py +# This file is automatically generated by tools/idna-data # vim: set fileencoding=utf-8 : """IDNA Mapping Table from UTS46.""" +__version__ = "6.3.0" def _seg_0(): return [ (0x0, '3'), diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/idna.egg-info/PKG-INFO new/idna-2.6/idna.egg-info/PKG-INFO --- old/idna-2.5/idna.egg-info/PKG-INFO 2017-03-07 04:27:22.000000000 +0100 +++ new/idna-2.6/idna.egg-info/PKG-INFO 2017-08-08 05:43:08.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 1.1 Name: idna -Version: 2.5 +Version: 2.6 Summary: Internationalized Domain Names in Applications (IDNA) Home-page: https://github.com/kjd/idna Author: Kim Davies @@ -171,6 +171,41 @@ when the codepoint is illegal based on its positional context (i.e. it is CONTEXTO or CONTEXTJ but the contextual requirements are not satisfied.) + Building and Diagnostics + ------------------------ + + The IDNA and UTS 46 functionality relies upon pre-calculated lookup tables for + performance. These tables are derived from computing against eligibility criteria + in the respective standards. These tables are computed using the command-line + script ``tools/idna-data``. + + This tool will fetch relevant tables from the Unicode Consortium and perform the + required calculations to identify eligibility. It has three main modes: + + * ``idna-data make-libdata``. Generates ``idnadata.py`` and ``uts46data.py``, + the pre-calculated lookup tables using for IDNA and UTS 46 conversions. Implementors + who wish to track this library against a different Unicode version may use this tool + to manually generate a different version of the ``idnadata.py`` and ``uts46data.py`` + files. + + * ``idna-data make-table``. Generate a table of the IDNA disposition + (e.g. PVALID, CONTEXTJ, CONTEXTO) in the format found in Appendix B.1 of RFC + 5892 and the pre-computed tables published by `IANA <http://iana.org/>`_. + + * ``idna-data U+0061``. Prints debugging output on the various properties + associated with an individual Unicode codepoint (in this case, U+0061), that are + used to assess the IDNA and UTS 46 status of a codepoint. This is helpful in debugging + or analysis. + + The tool accepts a number of arguments, described using ``idna-data -h``. Most notably, + the ``--version`` argument allows the specification of the version of Unicode to use + in computing the table data. For example, ``idna-data --version 9.0.0 make-libdata`` + will generate library data against Unicode 9.0.0. + + Note that this script requires Python 3, but all generated library data will work + in Python 2.6+. + + Testing ------- diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/idna.egg-info/SOURCES.txt new/idna-2.6/idna.egg-info/SOURCES.txt --- old/idna-2.5/idna.egg-info/SOURCES.txt 2017-03-07 04:27:22.000000000 +0100 +++ new/idna-2.6/idna.egg-info/SOURCES.txt 2017-08-08 05:43:08.000000000 +0200 @@ -10,11 +10,11 @@ idna/core.py idna/idnadata.py idna/intranges.py +idna/package_data.py idna/uts46data.py idna.egg-info/PKG-INFO idna.egg-info/SOURCES.txt idna.egg-info/dependency_links.txt -idna.egg-info/pbr.json idna.egg-info/top_level.txt tests/IdnaTest.txt.gz tests/__init__.py @@ -23,6 +23,5 @@ tests/test_idna_compat.py tests/test_idna_uts46.py tests/test_intranges.py -tools/build-idnadata.py -tools/build-uts46data.py +tools/idna-data tools/intranges.py \ No newline at end of file diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/idna.egg-info/pbr.json new/idna-2.6/idna.egg-info/pbr.json --- old/idna-2.5/idna.egg-info/pbr.json 2017-03-07 04:27:22.000000000 +0100 +++ new/idna-2.6/idna.egg-info/pbr.json 1970-01-01 01:00:00.000000000 +0100 @@ -1 +0,0 @@ -{"is_release": true, "git_version": "0088bfc"} \ No newline at end of file diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/setup.py new/idna-2.6/setup.py --- old/idna-2.5/setup.py 2017-03-07 04:25:15.000000000 +0100 +++ new/idna-2.6/setup.py 2017-06-28 16:56:52.000000000 +0200 @@ -9,7 +9,6 @@ import io, sys from setuptools import setup -version = "2.5" def main(): @@ -17,10 +16,13 @@ if python_version < (2,6): raise SystemExit("Sorry, Python 2.6 or newer required") + package_data = {} + exec(open('idna/package_data.py').read(), package_data) + arguments = { 'name': 'idna', 'packages': ['idna'], - 'version': version, + 'version': package_data['__version__'], 'description': 'Internationalized Domain Names in Applications (IDNA)', 'long_description': io.open("README.rst", encoding="UTF-8").read(), 'author': 'Kim Davies', diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/tools/build-idnadata.py new/idna-2.6/tools/build-idnadata.py --- old/idna-2.5/tools/build-idnadata.py 2017-03-07 04:22:47.000000000 +0100 +++ new/idna-2.6/tools/build-idnadata.py 1970-01-01 01:00:00.000000000 +0100 @@ -1,110 +0,0 @@ -#!/usr/bin/env python - -from __future__ import print_function - -try: - from urllib.request import urlopen -except ImportError: - from urllib2 import urlopen -import xml.etree.ElementTree as etree - -from intranges import intranges_from_list - -UNICODE_VERSION = '6.3.0' - -SCRIPTS_URL = "http://www.unicode.org/Public/{version}/ucd/Scripts.txt" -JOININGTYPES_URL = "http://www.unicode.org/Public/{version}/ucd/ArabicShaping.txt" -IDNATABLES_URL = "http://www.iana.org/assignments/idna-tables-{version}/idna-tables-{version}.xml" -IDNATABLES_NS = "http://www.iana.org/assignments" - -# These scripts are needed to compute IDNA contextual rules, see -# https://www.iana.org/assignments/idna-tables-6.3.0#idna-tables-context - -SCRIPT_WHITELIST = sorted(['Greek', 'Han', 'Hebrew', 'Hiragana', 'Katakana']) - - -def print_optimised_list(d): - print("(") - for value in intranges_from_list(d): - print(" {},".format(hex(value))) - print(" ),") - - -def build_idnadata(version): - - print("# This file is automatically generated by build-idnadata.py\n") - - # - # Script classifications are used by some CONTEXTO rules in RFC 5891 - # - print("scripts = {") - scripts = {} - for line in urlopen(SCRIPTS_URL.format(version=version)).readlines(): - line = line.decode('utf-8') - line = line.strip() - if not line or line[0] == '#': - continue - if line.find('#'): - line = line.split('#')[0] - (codepoints, scriptname) = [x.strip() for x in line.split(';')] - if not scriptname in scripts: - scripts[scriptname] = set() - if codepoints.find('..') > 0: - (begin, end) = [int(x, 16) for x in codepoints.split('..')] - for cp in range(begin, end+1): - scripts[scriptname].add(cp) - else: - scripts[scriptname].add(int(codepoints, 16)) - - for script in SCRIPT_WHITELIST: - print(" '{0}':".format(script), end=' ') - print_optimised_list(scripts[script]) - - print("}") - - # - # Joining types are used by CONTEXTJ rule A.1 - # - print("joining_types = {") - scripts = {} - for line in urlopen(JOININGTYPES_URL.format(version=version)).readlines(): - line = line.decode('utf-8') - line = line.strip() - if not line or line[0] == '#': - continue - (codepoint, name, joiningtype, group) = [x.strip() for x in line.split(';')] - print(" {0}: {1},".format(hex(int(codepoint, 16)), ord(joiningtype))) - print("}") - - # - # These are the classification of codepoints into PVALID, CONTEXTO, CONTEXTJ, etc. - # - print("codepoint_classes = {") - classes = {} - - namespace = "{{{0}}}".format(IDNATABLES_NS) - idntables_data = urlopen(IDNATABLES_URL.format(version=version)).read() - root = etree.fromstring(idntables_data) - - for record in root.findall('{0}registry[@id="idna-tables-properties"]/{0}record'.format(namespace)): - codepoint = record.find("{0}codepoint".format(namespace)).text - prop = record.find("{0}property".format(namespace)).text - if prop in ('UNASSIGNED', 'DISALLOWED'): - continue - if not prop in classes: - classes[prop] = set() - if codepoint.find('-') > 0: - (begin, end) = [int(x, 16) for x in codepoint.split('-')] - for cp in range(begin, end+1): - classes[prop].add(cp) - else: - classes[prop].add(int(codepoint, 16)) - - for prop in classes: - print(" '{0}':".format(prop), end=' ') - print_optimised_list(classes[prop]) - - print("}") - -if __name__ == "__main__": - build_idnadata(UNICODE_VERSION) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/tools/build-uts46data.py new/idna-2.6/tools/build-uts46data.py --- old/idna-2.5/tools/build-uts46data.py 2017-03-07 04:22:47.000000000 +0100 +++ new/idna-2.6/tools/build-uts46data.py 1970-01-01 01:00:00.000000000 +0100 @@ -1,106 +0,0 @@ -#!/usr/bin/env python - -"""Create a Python version of the IDNA Mapping Table from UTS46.""" - -import re -import sys - -# pylint: disable=unused-import,import-error,undefined-variable -if sys.version_info[0] == 3: - from urllib.request import urlopen - unichr = chr -else: - from urllib2 import urlopen -# pylint: enable=unused-import,import-error,undefined-variable - -UNICODE_VERSION = '6.3.0' -SEGMENT_SIZE = 100 - -DATA_URL = "http://www.unicode.org/Public/idna/{version}/IdnaMappingTable.txt" -RE_CHAR_RANGE = re.compile(br"([0-9a-fA-F]{4,6})(?:\.\.([0-9a-fA-F]{4,6}))?$") -STATUSES = { - b"valid": ("V", False), - b"ignored": ("I", False), - b"mapped": ("M", True), - b"deviation": ("D", True), - b"disallowed": ("X", False), - b"disallowed_STD3_valid": ("3", False), - b"disallowed_STD3_mapped": ("3", True) -} - - -def parse_idna_mapping_table(inputstream): - """Parse IdnaMappingTable.txt and return a list of tuples.""" - ranges = [] - last_code = -1 - last = (None, None) - for line in inputstream: - line = line.strip() - if b"#" in line: - line = line.split(b"#", 1)[0] - if not line: - continue - fields = [field.strip() for field in line.split(b";")] - char_range = RE_CHAR_RANGE.match(fields[0]) - if not char_range: - raise ValueError( - "Invalid character or range {!r}".format(fields[0])) - start = int(char_range.group(1), 16) - if start != last_code + 1: - raise ValueError( - "Code point {!r} is not continguous".format(fields[0])) - if char_range.lastindex == 2: - last_code = int(char_range.group(2), 16) - else: - last_code = start - status, mapping = STATUSES[fields[1]] - if mapping: - mapping = (u"".join(unichr(int(codepoint, 16)) - for codepoint in fields[2].split()). - replace("\\", "\\\\").replace("'", "\\'")) - else: - mapping = None - if start > 255 and (status, mapping) == last: - continue - last = (status, mapping) - while True: - if mapping is not None: - ranges.append(u"(0x{0:X}, '{1}', u'{2}')".format( - start, status, mapping)) - else: - ranges.append(u"(0x{0:X}, '{1}')".format(start, status)) - start += 1 - if start > 255 or start > last_code: - break - return ranges - - -def build_uts46data(version): - """Fetch the mapping table, parse it, and rewrite idna/uts46data.py.""" - ranges = parse_idna_mapping_table(urlopen(DATA_URL.format(version=version))) - with open("idna/uts46data.py", "wb") as outputstream: - outputstream.write(b'''\ -# This file is automatically generated by tools/build-uts46data.py -# vim: set fileencoding=utf-8 : - -"""IDNA Mapping Table from UTS46.""" - - -''') - for idx, row in enumerate(ranges): - if idx % SEGMENT_SIZE == 0: - if idx!=0: - outputstream.write(b" ]\n\n") - outputstream.write(u"def _seg_{0}():\n return [\n".format(idx/SEGMENT_SIZE).encode("utf8")) - outputstream.write(u" {0},\n".format(row).encode("utf8")) - outputstream.write(b" ]\n\n") - outputstream.write(b"uts46data = tuple(\n") - - outputstream.write(b" _seg_0()\n") - for i in xrange(1, (len(ranges)-1)/SEGMENT_SIZE+1): - outputstream.write(u" + _seg_{0}()\n".format(i).encode("utf8")) - outputstream.write(b")\n") - - -if __name__ == "__main__": - build_uts46data(UNICODE_VERSION) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/idna-2.5/tools/idna-data new/idna-2.6/tools/idna-data --- old/idna-2.5/tools/idna-data 1970-01-01 01:00:00.000000000 +0100 +++ new/idna-2.6/tools/idna-data 2017-08-08 05:42:40.000000000 +0200 @@ -0,0 +1,671 @@ +#!/usr/bin/env python3 + +import argparse, collections, datetime, os, re, sys, unicodedata +from urllib.request import urlopen +from intranges import intranges_from_list + +if sys.version_info[0] < 3: + print("Only Python 3 supported.") + sys.exit(2) + +# PREFERRED_VERSION = 'latest' # https://github.com/kjd/idna/issues/8 +PREFERRED_VERSION = '6.3.0' +UCD_URL = 'http://www.unicode.org/Public/{version}/ucd/{filename}' +UTS46_URL = 'http://www.unicode.org/Public/idna/{version}/{filename}' + +DEFAULT_CACHE_DIR = '~/.cache/unidata' + +# Scripts affected by IDNA contextual rules +SCRIPT_WHITELIST = sorted(['Greek', 'Han', 'Hebrew', 'Hiragana', 'Katakana']) + +# Used to piece apart UTS#46 data for Jython compatibility +UTS46_SEGMENT_SIZE = 100 + +UTS46_STATUSES = { + "valid": ("V", False), + "ignored": ("I", False), + "mapped": ("M", True), + "deviation": ("D", True), + "disallowed": ("X", False), + "disallowed_STD3_valid": ("3", False), + "disallowed_STD3_mapped": ("3", True) +} + +# Exceptions are manually assigned in Section 2.6 of RFC 5892. +exceptions = { + 0x00DF: 'PVALID', # LATIN SMALL LETTER SHARP S + 0x03C2: 'PVALID', # GREEK SMALL LETTER FINAL SIGMA + 0x06FD: 'PVALID', # ARABIC SIGN SINDHI AMPERSAND + 0x06FE: 'PVALID', # ARABIC SIGN SINDHI POSTPOSITION MEN + 0x0F0B: 'PVALID', # TIBETAN MARK INTERSYLLABIC TSHEG + 0x3007: 'PVALID', # IDEOGRAPHIC NUMBER ZERO + 0x00B7: 'CONTEXTO', # MIDDLE DOT + 0x0375: 'CONTEXTO', # GREEK LOWER NUMERAL SIGN (KERAIA) + 0x05F3: 'CONTEXTO', # HEBREW PUNCTUATION GERESH + 0x05F4: 'CONTEXTO', # HEBREW PUNCTUATION GERSHAYIM + 0x30FB: 'CONTEXTO', # KATAKANA MIDDLE DOT + 0x0660: 'CONTEXTO', # ARABIC-INDIC DIGIT ZERO + 0x0661: 'CONTEXTO', # ARABIC-INDIC DIGIT ONE + 0x0662: 'CONTEXTO', # ARABIC-INDIC DIGIT TWO + 0x0663: 'CONTEXTO', # ARABIC-INDIC DIGIT THREE + 0x0664: 'CONTEXTO', # ARABIC-INDIC DIGIT FOUR + 0x0665: 'CONTEXTO', # ARABIC-INDIC DIGIT FIVE + 0x0666: 'CONTEXTO', # ARABIC-INDIC DIGIT SIX + 0x0667: 'CONTEXTO', # ARABIC-INDIC DIGIT SEVEN + 0x0668: 'CONTEXTO', # ARABIC-INDIC DIGIT EIGHT + 0x0669: 'CONTEXTO', # ARABIC-INDIC DIGIT NINE + 0x06F0: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT ZERO + 0x06F1: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT ONE + 0x06F2: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT TWO + 0x06F3: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT THREE + 0x06F4: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT FOUR + 0x06F5: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT FIVE + 0x06F6: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT SIX + 0x06F7: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT SEVEN + 0x06F8: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT EIGHT + 0x06F9: 'CONTEXTO', # EXTENDED ARABIC-INDIC DIGIT NINE + 0x0640: 'DISALLOWED', # ARABIC TATWEEL + 0x07FA: 'DISALLOWED', # NKO LAJANYALAN + 0x302E: 'DISALLOWED', # HANGUL SINGLE DOT TONE MARK + 0x302F: 'DISALLOWED', # HANGUL DOUBLE DOT TONE MARK + 0x3031: 'DISALLOWED', # VERTICAL KANA REPEAT MARK + 0x3032: 'DISALLOWED', # VERTICAL KANA REPEAT WITH VOICED SOUND MARK + 0x3033: 'DISALLOWED', # VERTICAL KANA REPEAT MARK UPPER HALF + 0x3034: 'DISALLOWED', # VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HA + 0x3035: 'DISALLOWED', # VERTICAL KANA REPEAT MARK LOWER HALF + 0x303B: 'DISALLOWED', # VERTICAL IDEOGRAPHIC ITERATION MARK +} +backwardscompatible = {} + + +def hexrange(start, end): + return range(int(start, 16), int(end, 16) + 1) + +def hexvalue(value): + return int(value, 16) + + +class UnicodeVersion(object): + + def __init__(self, version): + result = re.match('^(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)$', version) + if result: + self.major = int(result.group('major')) + self.minor = int(result.group('minor')) + self.patch = int(result.group('patch')) + self.numerical = (self.major << 8) + (self.minor << 4) + self.patch + self.latest = False + elif version == 'latest': + self.latest = True + else: + raise ValueError('Unrecognized Unicode version') + + def __repr__(self, with_date=True): + if self.latest: + if with_date: + return 'latest@{}'.format(datetime.datetime.now().strftime('%Y-%m-%d')) + else: + return 'latest' + else: + return "{}.{}.{}".format(self.major, self.minor, self.patch) + + @property + def tag(self): + return self.__repr__(with_date=False) + + def __gt__(self, other): + if self.latest: + return True + return self.numerical > other.numerical + + def __eq__(self, other): + if self.latest: + return False + return self.numerical == other.numerical + + +class UnicodeData(object): + + def __init__(self, version, cache, args): + self.version = UnicodeVersion(version) + self.system_version = UnicodeVersion(unicodedata.unidata_version) + self.source = args.source + self.cache = cache + self.max = 0 + + if self.system_version < self.version: + print("Warning: Character stability not guaranteed as Python Unicode data {}" + " older than requested {}".format(self.system_version, self.version)) + + self._load_unicodedata() + self._load_proplist() + self._load_derivedcoreprops() + self._load_blocks() + self._load_casefolding() + self._load_hangulst() + self._load_arabicshaping() + self._load_scripts() + self._load_uts46mapping() + + def _load_unicodedata(self): + + f_ud = self._ucdfile('UnicodeData.txt') + self.ucd_data = {} + range_begin = None + for line in f_ud.splitlines(): + fields = line.split(';') + value = int(fields[0], 16) + start_marker = re.match('^<(?P<name>.*?), First>$', fields[1]) + end_marker = re.match('^<(?P<name>.*?), Last>$', fields[1]) + if start_marker: + range_begin = value + elif end_marker: + for i in range(range_begin, value+1): + fields[1] = '<{}>'.format(end_marker.group('name')) + self.ucd_data[i] = fields[1:] + range_begin = None + else: + self.ucd_data[value] = fields[1:] + + def _load_proplist(self): + + f_pl = self._ucdfile('PropList.txt') + self.ucd_props = collections.defaultdict(list) + for line in f_pl.splitlines(): + result = re.match( + '^(?P<start>[0-9A-F]{4,6})(|\.\.(?P<end>[0-9A-F]{4,6}))\s*;\s*(?P<prop>\S+)\s*(|\#.*)$', + line) + if result: + if result.group('end'): + for i in hexrange(result.group('start'), result.group('end')): + self.ucd_props[i].append(result.group('prop')) + else: + i = hexvalue(result.group('start')) + self.ucd_props[i].append(result.group('prop')) + + def _load_derivedcoreprops(self): + + f_dcp = self._ucdfile('DerivedCoreProperties.txt') + for line in f_dcp.splitlines(): + result = re.match( + '^(?P<start>[0-9A-F]{4,6})(|\.\.(?P<end>[0-9A-F]{4,6}))\s*;\s*(?P<prop>\S+)\s*(|\#.*)$', + line) + if result: + if result.group('end'): + for i in hexrange(result.group('start'), result.group('end')): + self.ucd_props[i].append(result.group('prop')) + else: + i = hexvalue(result.group('start')) + self.ucd_props[i].append(result.group('prop')) + + def _load_blocks(self): + + self.ucd_block = {} + f_b = self._ucdfile('Blocks.txt') + for line in f_b.splitlines(): + result = re.match( + '^(?P<start>[0-9A-F]{4,6})\.\.(?P<end>[0-9A-F]{4,6})\s*;\s*(?P<block>.*)\s*$', + line) + if result: + for i in hexrange(result.group('start'), result.group('end')): + self.ucd_block[i] = result.group('block') + self.max = max(self.max, i) + + def _load_casefolding(self): + + self.ucd_cf = {} + f_cf = self._ucdfile('CaseFolding.txt') + for line in f_cf.splitlines(): + result = re.match( + '^(?P<cp>[0-9A-F]{4,6})\s*;\s*(?P<type>\S+)\s*;\s*(?P<subst>[0-9A-F\s]+)\s*', + line) + if result: + if result.group('type') in ('C', 'F'): + self.ucd_cf[int(result.group('cp'), 16)] = \ + ''.join([chr(int(x, 16)) for x in result.group('subst').split(' ')]) + + def _load_hangulst(self): + + self.ucd_hst = {} + f_hst = self._ucdfile('HangulSyllableType.txt') + for line in f_hst.splitlines(): + result = re.match( + '^(?P<start>[0-9A-F]{4,6})\.\.(?P<end>[0-9A-F]{4,6})\s*;\s*(?P<type>\S+)\s*(|\#.*)$', + line) + if result: + for i in hexrange(result.group('start'), result.group('end')): + self.ucd_hst[i] = result.group('type') + + def _load_arabicshaping(self): + + self.ucd_as = {} + f_as = self._ucdfile('ArabicShaping.txt') + for line in f_as.splitlines(): + result = re.match('^(?P<cp>[0-9A-F]{4,6})\s*;\s*.*?\s*;\s*(?P<jt>\S+)\s*;', line) + if result: + self.ucd_as[int(result.group('cp'), 16)] = result.group('jt') + + def _load_scripts(self): + + self.ucd_s = {} + f_s = self._ucdfile('Scripts.txt') + for line in f_s.splitlines(): + result = re.match( + '^(?P<start>[0-9A-F]{4,6})(|\.\.(?P<end>[0-9A-F]{4,6}))\s*;\s*(?P<script>\S+)\s*(|\#.*)$', + line) + if result: + if not result.group('script') in self.ucd_s: + self.ucd_s[result.group('script')] = set() + if result.group('end'): + for i in hexrange(result.group('start'), result.group('end')): + self.ucd_s[result.group('script')].add(i) + else: + i = hexvalue(result.group('start')) + self.ucd_s[result.group('script')].add(i) + + def _load_uts46mapping(self): + + self.ucd_idnamt = {} + f_idnamt = self._ucdfile('IdnaMappingTable.txt', urlbase=UTS46_URL) + for line in f_idnamt.splitlines(): + result = re.match( + '^(?P<start>[0-9A-F]{4,6})(|\.\.(?P<end>[0-9A-F]{4,6}))\s*;\s*(?P<fields>[^#]+)', + line) + if result: + fields = [x.strip() for x in result.group('fields').split(';')] + if result.group('end'): + for i in hexrange(result.group('start'), result.group('end')): + self.ucd_idnamt[i] = fields + else: + i = hexvalue(result.group('start')) + self.ucd_idnamt[i] = fields + + def _ucdfile(self, filename, urlbase=UCD_URL): + if self.source: + f = open("{}/{}".format(self.source, filename)) + return f.read() + else: + cache_file = None + if self.cache: + cache_file = os.path.expanduser("{}/{}/{}".format( + self.cache, self.version.tag, filename)) + if os.path.isfile(cache_file): + f = open(cache_file) + return f.read() + + version_path = self.version.tag + if version_path == 'latest': + version_path = 'UCD/latest' + url = urlbase.format( + version=version_path, + filename=filename, + ) + content = urlopen(url).read() + + if cache_file: + if not os.path.isdir(os.path.dirname(cache_file)): + os.makedirs(os.path.dirname(cache_file)) + f = open(cache_file, 'wb') + f.write(content) + f.close() + + return str(content) + + def codepoints(self): + for i in range(0, self.max + 1): + yield CodePoint(i, ucdata=self) + + +class CodePoint: + + def __init__(self, value=None, ucdata=None): + self.value = value + self.ucdata = ucdata + + def _casefold(self, s): + r = '' + for c in s: + r += self.ucdata.ucd_cf.get(ord(c), c) + return r + + @property + def exception_value(self): + return exceptions.get(self.value, False) + + @property + def compat_value(self): + return backwardscompatible.get(self.value, False) + + @property + def name(self): + if self.value in self.ucdata.ucd_data: + return self.ucdata.ucd_data[self.value][0] + elif 'Noncharacter_Code_Point' in self.ucdata.ucd_props[self.value]: + return '<noncharacter>' + else: + return '<reserved>' + + @property + def general_category(self): + return self.ucdata.ucd_data.get(self.value, [None, None])[1] + + @property + def unassigned(self): + return not ('Noncharacter_Code_Point' in self.ucdata.ucd_props[self.value] or \ + self.value in self.ucdata.ucd_data) + + @property + def ldh(self): + if self.value == 0x002d or \ + self.value in range(0x0030, 0x0039+1) or \ + self.value in range(0x0061, 0x007a+1): + return True + return False + + @property + def join_control(self): + return 'Join_Control' in self.ucdata.ucd_props[self.value] + + @property + def joining_type(self): + return self.ucdata.ucd_as.get(self.value, None) + + @property + def char(self): + return chr(self.value) + + @property + def nfkc_cf(self): + return unicodedata.normalize('NFKC', + self._casefold(unicodedata.normalize('NFKC', self.char))) + + @property + def unstable(self): + return self.char != self.nfkc_cf + + @property + def in_ignorableproperties(self): + for prop in ['Default_Ignorable_Code_Point', 'White_Space', 'Noncharacter_Code_Point']: + if prop in self.ucdata.ucd_props[self.value]: + return True + return False + + @property + def in_ignorableblocks(self): + return self.ucdata.ucd_block.get(self.value) in ( + 'Combining Diacritical Marks for Symbols', 'Musical Symbols', + 'Ancient Greek Musical Notation' + ) + + @property + def oldhanguljamo(self): + return self.ucdata.ucd_hst.get(self.value) in ('L', 'V', 'T') + + @property + def in_lettersdigits(self): + return self.general_category in ('Ll', 'Lu', 'Lo', 'Nd', 'Lm', 'Mn', 'Mc') + + @property + def idna2008_status(self): + if self.exception_value: + return self.exception_value + elif self.compat_value: + return self.compat_value + elif self.unassigned: + return 'UNASSIGNED' + elif self.ldh: + return 'PVALID' + elif self.join_control: + return 'CONTEXTJ' + elif self.unstable: + return 'DISALLOWED' + elif self.in_ignorableproperties: + return 'DISALLOWED' + elif self.in_ignorableblocks: + return 'DISALLOWED' + elif self.oldhanguljamo: + return 'DISALLOWED' + elif self.in_lettersdigits: + return 'PVALID' + else: + return 'DISALLOWED' + + @property + def uts46_data(self): + return self.ucdata.ucd_idnamt.get(self.value, None) + + @property + def uts46_status(self): + return ' '.join(self.uts46_data) + + +def diagnose_codepoint(codepoint, args, ucdata): + + cp = CodePoint(codepoint, ucdata=ucdata) + + print("U+{:04X}:".format(codepoint)) + print(" Name: {}".format(cp.name)) + print("1 Exceptions: {}".format(exceptions.get(codepoint, False))) + print("2 Backwards Compat: {}".format(backwardscompatible.get(codepoint, False))) + print("3 Unassigned: {}".format(cp.unassigned)) + print("4 LDH: {}".format(cp.ldh)) + print(" Properties: {}".format(" ".join(sorted(ucdata.ucd_props.get(codepoint, ['None']))))) + print("5 .Join Control: {}".format(cp.join_control)) + print(" NFKC CF: {}".format(" ".join(["U+{:04X}".format(ord(x)) for x in cp.nfkc_cf]))) + print("6 .Unstable: {}".format(cp.unstable)) + print("7 .Ignorable Prop: {}".format(cp.in_ignorableproperties)) + print(" Block: {}".format(ucdata.ucd_block.get(codepoint, None))) + print("8 .Ignorable Block: {}".format(cp.in_ignorableblocks)) + print(" Hangul Syll Type: {}".format(ucdata.ucd_hst.get(codepoint, None))) + print("9 .Old Hangul Jamo: {}".format(cp.oldhanguljamo)) + print(" General Category: {}".format(cp.general_category)) + print("10 .Letters Digits: {}".format(cp.in_lettersdigits)) + print("== IDNA 2008: {}".format(cp.idna2008_status)) + print("== UTS 46: {}".format(cp.uts46_status)) + print("(Unicode {} [sys:{}])".format(ucdata.version, ucdata.system_version)) + +def ucdrange(start, end): + if start == end: + return ("{:04X}".format(start.value), start.name) + else: + return ("{:04X}..{:04X}".format(start.value, end.value), + "{}..{}".format(start.name, end.name)) + +def optimised_list(d): + yield '(' + for value in intranges_from_list(d): + yield ' {},'.format(hex(value)) + yield ' ),' + +def make_table(args, ucdata): + + last_status = None + cps = [] + table_data = [] + + for cp in ucdata.codepoints(): + status = cp.idna2008_status + if (last_status and last_status != status): + (values, description) = ucdrange(cps[0], cps[-1]) + table_data.append([values, last_status, description]) + cps = [] + last_status = status + cps.append(cp) + (values, description) = ucdrange(cps[0], cps[-1]) + table_data.append([values, last_status, description]) + + if args.dir: + + f = open("{}/idna-table-{}.txt".format(args.dir, ucdata.version), 'wb') + for row in table_data: + f.write("{:12}; {:12}# {:.44}\n".format(*row).encode('ascii')) + f.close() + + else: + + for row in table_data: + print("{:12}; {:12}# {:.44}".format(*row)) + +def idna_libdata(ucdata): + + yield "# This file is automatically generated by tools/idna-data\n" + yield "__version__ = \"{}\"".format(ucdata.version) + + # + # Script classifications are used by some CONTEXTO rules in RFC 5891 + # + yield "scripts = {" + for script in SCRIPT_WHITELIST: + prefix = " '{0}': ".format(script) + for line in optimised_list(ucdata.ucd_s[script]): + yield prefix + line + prefix = "" + yield "}" + + # + # Joining types are used by CONTEXTJ rule A.1 + # + yield "joining_types = {" + for cp in ucdata.codepoints(): + if cp.joining_type: + yield " 0x{0:x}: {1},".format(cp.value, ord(cp.joining_type)) + yield "}" + + # + # These are the classification of codepoints into PVALID, CONTEXTO, CONTEXTJ, etc. + # + yield "codepoint_classes = {" + classes = {} + for cp in ucdata.codepoints(): + status = cp.idna2008_status + if status in ('UNASSIGNED', 'DISALLOWED'): + continue + if not status in classes: + classes[status] = set() + classes[status].add(cp.value) + for status in ['PVALID', 'CONTEXTJ', 'CONTEXTO']: + prefix = " '{0}': ".format(status) + for line in optimised_list(classes[status]): + yield prefix + line + prefix = "" + yield "}" + +def uts46_ranges(ucdata): + + last = (None, None) + for cp in ucdata.codepoints(): + fields = cp.uts46_data + if not fields: + continue + status, mapping = UTS46_STATUSES[fields[0]] + if mapping: + mapping = "".join(chr(int(codepoint, 16)) for codepoint in fields[1].split()) + mapping = mapping.replace("\\", "\\\\").replace("'", "\\'") + else: + mapping = None + if cp.value > 255 and (status, mapping) == last: + continue + last = (status, mapping) + + if mapping is not None: + yield "(0x{0:X}, '{1}', u'{2}')".format(cp.value, status, mapping) + else: + yield "(0x{0:X}, '{1}')".format(cp.value, status) + +def uts46_libdata(ucdata): + + yield "# This file is automatically generated by tools/idna-data" + yield "# vim: set fileencoding=utf-8 :\n" + yield '"""IDNA Mapping Table from UTS46."""\n\n' + + yield "__version__ = \"{}\"".format(ucdata.version) + + idx = -1 + for row in uts46_ranges(ucdata): + idx += 1 + if idx % UTS46_SEGMENT_SIZE == 0: + if idx != 0: + yield " ]\n" + yield "def _seg_{0}():\n return [".format(idx // UTS46_SEGMENT_SIZE) + yield " {0},".format(row) + yield " ]\n" + + yield "uts46data = tuple(" + yield " _seg_0()" + for i in range(1, idx // UTS46_SEGMENT_SIZE + 1): + yield " + _seg_{0}()".format(i) + yield ")" + +def make_libdata(args, ucdata): + + dest_dir = args.dir or '.' + + target_filename = os.path.join(dest_dir, 'idnadata.py') + with open(target_filename, 'wb') as target: + for line in idna_libdata(ucdata): + target.write((line + "\n").encode('utf-8')) + + target_filename = os.path.join(dest_dir, 'uts46data.py') + with open(target_filename, 'wb') as target: + for line in uts46_libdata(ucdata): + target.write((line + "\n").encode('utf-8')) + +def arg_error(message, parser): + + parser.print_usage() + print('{}: error: {}'.format(sys.argv[0], message)) + sys.exit(2) + +def main(): + + parser = argparse.ArgumentParser(description='Determine IDNA code-point validity data') + parser.add_argument('action', type=str, default='preferred', + help='Task to perform (make-libdata, make-tables, <codepoint>)') + + parser.add_argument('--version', type=str, default='preferred', + help='Unicode version to use (preferred, latest, <x.y.z>)') + parser.add_argument('--source', type=str, default=None, + help='Where to fetch Unicode data (file path)') + parser.add_argument('--dir', type=str, default=None, help='Where to export the output') + parser.add_argument('--cache', type=str, default=None, help='Where to cache Unicode data') + parser.add_argument('--no-cache', action='store_true', help='Don\'t cache Unicode data') + libdata = parser.add_argument_group('make-libdata', 'Make module data for Python IDNA library') + + tables = parser.add_argument_group('make-table', 'Make IANA-style reference table') + + codepoint = parser.add_argument_group('codepoint', + 'Display related data for given codepoint (e.g. U+0061)') + + args = parser.parse_args() + + if args.version == 'preferred': + target_version = PREFERRED_VERSION + else: + target_version = args.version + + if args.cache and args.no_cache: + arg_error('I can\'t both --cache and --no-cache', parser) + cache = args.cache or DEFAULT_CACHE_DIR + if args.no_cache: + cache = None + + ucdata = UnicodeData(target_version, cache, args) + + if args.action == 'make-table': + make_table(args, ucdata) + elif args.action == 'make-libdata': + make_libdata(args, ucdata) + else: + result = re.match('^(?i)(U\+|)(?P<cp>[0-9A-F]{4,6})$', args.action) + if result: + codepoint = int(result.group('cp'), 16) + diagnose_codepoint(codepoint, args, ucdata) + sys.exit(0) + arg_error('Don\'t recognize action or codepoint value', parser) + + +if __name__ == '__main__': + main() + + +
