My error. I should have specifed XML 1.0 as that is the spec that I drew the test code from: https://www.w3.org/TR/xml/#NT-NameStartChar
So this is an error in Xerces to meet the XML 1.0 naming spec. I have opened a defect with Xerces ( https://issues.apache.org/jira/browse/XERCESJ-1690) but I don't expect much movement there. Claude On Wed, Feb 14, 2018 at 10:38 AM, Rob Vesse <[email protected]> wrote: > If memory serves this is mostly historical, once upon a time RDF/XML was > the only serialisation available and so everything had to be XML compliant. > Obviously things have evolved over time but the implementation is > conservative in this regards. > > Also I think XML 1.1 post-dates RDF/XML and various other specifications > all of which are defined in terms of XML 1.0. For maximum compatibility it > is better for us to be conservative because most of the ecosystem has not > adopted XML 1.1 yet > > Rob > > On 14/02/2018, 09:04, "Claude Warren" <[email protected]> wrote: > > The issue is that predicate namespaces are parsed with XMLChar. So if > I > have one that is correctly formed based on XML 1.1 spec but the XMLChar > code does not recognizes the first character of the local name it will > not > split the URL correctly. All code that depende upon > Resource.getNamespace() and Resource.getLocalName() will be > incorrect. It > seems to me this is a low level problem. > > While it should be easy to fix the parsing problem, I am not certain > what > effect that will have on any other code that is dependent upon the > Xerces > code (where XMLChar originates). > > Claude > > On Tue, Feb 13, 2018 at 6:50 PM, Andy Seaborne <[email protected]> > wrote: > > > Maybe SplitIRI will help? > > > > It does Turtle splitting as well as XML. > > > > Andy > > > > > > On 13/02/18 17:39, Claude Warren wrote: > > > >> It is used in org.apache.jena.rdf.model.impl.Util namespace > splitting > >> code. > >> > >> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne <[email protected]> > wrote: > >> > >> Where is XMLChar.isNameStart being used? > >>> > >>> > >>> On 13/02/18 13:10, Claude Warren wrote: > >>> > >>> Is there a reason that Jena does not support the full range of XML > name > >>>> start characters? > >>>> > >>>> see https://www.w3.org/TR/xml/#NT-NameStartChar > >>>> > >>>> I wrote a quick test and found that there were a number of > characters > >>>> that > >>>> Jena does not support. > >>>> Miscategorization appears to start at 0x132. There are 936990 > >>>> miscategorized characters. > >>>> > >>>> The issue is actually in the Xerces util class XMLChar > >>>> > >>>> Is this because of the version of Xerces we are stuck with? Is > there a > >>>> way > >>>> around this issue? > >>>> > >>>> Claude > >>>> > >>>> p.s. Since I can't attach a file, here is the test code I wrote. > >>>> > >>>> import static org.junit.Assert.assertTrue; > >>>> > >>>> import org.apache.xerces.util.XMLChar; > >>>> import org.junit.Test; > >>>> > >>>> public class NameTest { > >>>> /* > >>>> * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | > [#xC0-#xD6] | > >>>> [#xD8-#xF6] | > >>>> * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | > >>>> [#x200C-#x200D] | > >>>> * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | > >>>> [#xF900-#xFDCF] | > >>>> * [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] > >>>> */ > >>>> > >>>> int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' > }, { > >>>> 0xC0, > >>>> 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF }, > >>>> { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, > 0x200D }, { > >>>> 0x2070, 0x218F }, { 0x2C00, 0x2FEF }, > >>>> { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, > 0xFFFD > >>>> }, { > >>>> 0x10000, 0xEFFFF } }; > >>>> > >>>> @Test > >>>> public void testNameStart() { > >>>> > >>>> for (int[] range : ranges) { > >>>> for (int c = range[0]; c <= range[1]; c++) { > >>>> assertTrue( String.format( "character %s > >>>> 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) ); > >>>> } > >>>> } > >>>> > >>>> } > >>>> > >>>> @Test > >>>> public void listNameStartErr() { > >>>> int cnt = 0; > >>>> for (int[] range : ranges) { > >>>> for (int c = range[0]; c <= range[1]; c++) { > >>>> if (!XMLChar.isNameStart( c )) > >>>> { > >>>> System.out.print( String.format( "0x%s > >>>> ",Integer.toHexString( c )) ); > >>>> cnt++; > >>>> if (cnt % 25 == 0) > >>>> { > >>>> System.out.println(); > >>>> } > >>>> > >>>> } > >>>> > >>>> } > >>>> } > >>>> System.out.println(); > >>>> System.out.println( cnt+" characters miscategorized" ); > >>>> } > >>>> > >>>> } > >>>> > >>>> > >>>> > >>>> > >> > >> > > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren > > > > > > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren
