Maybe SplitIRI will help?

It does Turtle splitting as well as XML.

    Andy

On 13/02/18 17:39, Claude Warren wrote:
It is used in org.apache.jena.rdf.model.impl.Util namespace splitting code.

On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne <a...@apache.org> wrote:

Where is XMLChar.isNameStart being used?


On 13/02/18 13:10, Claude Warren wrote:

Is there a reason that Jena does not support the full range of XML name
start characters?

see https://www.w3.org/TR/xml/#NT-NameStartChar

I wrote a quick test and found that there were a number of characters that
Jena does not support.
Miscategorization appears to start at 0x132.  There are 936990
miscategorized characters.

The issue is actually in the Xerces util class XMLChar

Is this because of the version of Xerces we are stuck with?  Is there a
way
around this issue?

Claude

p.s. Since I can't attach a file, here is the test code I wrote.

import static org.junit.Assert.assertTrue;

import org.apache.xerces.util.XMLChar;
import org.junit.Test;

public class NameTest {
      /*
       * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
[#xD8-#xF6] |
       * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
       * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
[#xF900-#xFDCF] |
       * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
       */

      int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, { 0xC0,
0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
              { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
0x2070, 0x218F }, { 0x2C00, 0x2FEF },
              { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD }, {
0x10000, 0xEFFFF } };

      @Test
      public void testNameStart() {

          for (int[] range : ranges) {
              for (int c = range[0]; c <= range[1]; c++) {
                  assertTrue( String.format( "character %s
0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
              }
          }

      }

      @Test
      public void listNameStartErr() {
          int cnt = 0;
          for (int[] range : ranges) {
              for (int c = range[0]; c <= range[1]; c++) {
                  if (!XMLChar.isNameStart( c ))
                  {
                      System.out.print( String.format( "0x%s
",Integer.toHexString( c )) );
                      cnt++;
                      if (cnt % 25 == 0)
                      {
                          System.out.println();
                      }

                  }

              }
          }
          System.out.println();
          System.out.println( cnt+" characters miscategorized"  );
      }

}





Reply via email to