[ 
https://issues.apache.org/jira/browse/ANY23-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281300#comment-13281300
 ] 

Peter Ansell commented on ANY23-99:
-----------------------------------

I have had instances in the past where the most difficult to find fault in a 
UTF-8-standardised system has been the use of an 
OutputStreamWriter(OutputStream) constructor instead of the 
OutputStreamWriter(OutputStream,Charset) constructor. I have no specific 
example of non-ASCII output coming out of the NQuadsWriter. Are there any 
character sets that could create non-ASCII compatible NQuads documents if the 
users locale was setup with the charset and OutputStreamWriter(OutputStreap) 
inherited that locale by default because we didn't specify US-ASCII explicitly? 
The escaping seems to make it okay at a semantic level but it would still 
practically be variable based on the JVM environment properties if it isn't 
explicitly set. Not changing the constructor just seems like we are looking for 
a bug that could be easily avoided (based on the current spec saying 
ASCII-only).

There are examples of non-ASCII data successfully going into the NQuadsParser 
in NQuadsParserTest, which is to be expected if we accept liberally and output 
standardised NQuads, although it is a little strange that the test suite 
explicitly supports it given the specification is very clear currently about 
the \u encoding rules for all non-ASCII characters.

It would be great if both NTriples and NQuads would be able to fully support 
UTF-8 when they are revised. It is also great that NTriples is getting a 
specific MIME type this time around. Hopefully the distinction between the two 
types for essentially the same format doesn't confuse people. It seems fairly 
unique to have a scenario where a single format has two legitimate types where 
the only difference is the encoding rules. It would be ideal to be able to 
handle \uNNNN the same as the native UTF-8 bytes and that would make it 
possible to parse old documents while all new documents just use UTF-8 without 
having to check whether they wanted text/plain NTriples or 
application/n-triples NTriples when writing out. 

Naively I would see this possibly requiring two different Rio writers (as Rio 
writers have a unique relationship with single RDFFormat which has a single 
charset attached to it) and possibly two different Rio parsers for the same 
reason. That doesn't really seem ideal but if necessary it may be a workaround.
                
> NQuadsWriter should force ASCII in OutputStream constructor
> -----------------------------------------------------------
>
>                 Key: ANY23-99
>                 URL: https://issues.apache.org/jira/browse/ANY23-99
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: Peter Ansell
>
> The NQuads specification states that all NQuads documents must be ASCII 
> encoded. [1] The current NQuadsWriter(OutputStream) constructor does not 
> enforce this when creating the OutputStreamWriter to wrap up the given 
> outputstream. If it is not enforced, then the users locale will be used to 
> create the OutputStreamWriter, which may not enforce US-ASCII.
> Patch is to replace the constructor with:
>         this( new OutputStreamWriter(os, Charset.forName("US-ASCII")) );
> [1] http://sw.deri.org/2008/07/n-quads/#mediatype

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to