[ 
https://issues.apache.org/jira/browse/JENA-641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13901576#comment-13901576
 ] 

Andy Seaborne edited comment on JENA-641 at 2/14/14 4:03 PM:
-------------------------------------------------------------

Not a bug.

The java charset converters don't say where the encoding fails.  Normally, you 
get junk characters (e.g. ISO-8859-1 read as UTF-8 most of the time passes 
though but is non-codepoints); occasionally, the actual UTF-8 encoding is 
illegal.  Java offers to put in a "bad character" marker or return an error.  
RIOT chooses the error case because to scan for the bad character afterwards is 
costly and this is critical path for parsing.

To find a bad encoding we would need to do something like:

# Switch to bad character markers and post-scan the output
# Have our own UTF-8 to character converter.
# Keep a copy of input and either switch to our own converter or submit small 
chunks to find the error.

I have tried (2) before - there is a code-based, not lookup table based, input 
stream reader (InStreamUTF8). It's slower than using the Java native decoders 
even though it does one less copy of data.  Presumably the Java native decoders 
are in C (they are native code) and with the care they are written, this 
outweighs the extra copy.  The speed difference even when run inside a parser 
doing tokenizing and grammar rules as well is measurable.

To do (3) needs to keep to keep a copy of the data as it flows through which 
will impact performance measurably.  Similarly, (1) is second pass over the 
data, with a bad data cache pattern.



was (Author: andy.seaborne):
Not a bug.

The java charset converters don't say where the encoding fails.  Normally, you 
get junk charcaters (e.g. ISO-8859-1 read as UTF-8 most of the time passes 
though but is non-codepoints); occasionally, the actual UTF-8 encoding is 
illegal.  Java offers to put in a "bad character" marker or return an error.  
RIOT chooses the error case because to scan for the bad character afterwards is 
costly and this is critical path for parsing.

To find a bad encoding we would need to do something like:

# Switch to bad character makrers and post-scan the output
# Have our own UTF-8 to character converter.
# Keep a copy of input and either switch to our own converter or submit small 
chunks to find the error.

I have tried (2) before - there is a code-based, not lookup table based, input 
stream reader. It's slower than using the Java native decoders even though it 
does one less copy of data.  Presumably the Java native decoders are in C (they 
are native code) and with teh care they are written, this outweighs the extra 
copy.  The speed difference even when run inside a parser doing tokenizing and 
grammar as well is measurable.

To do (3) needs to keep to keep a copy of the data as it flows through which 
will impact performance measurably.  Similarly, (1) is a appreciable cost.


> org.apache.jena.atlas.AtlasException on particular Turtle file
> --------------------------------------------------------------
>
>                 Key: JENA-641
>                 URL: https://issues.apache.org/jira/browse/JENA-641
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: RIOT
>    Affects Versions: Jena 2.11.1
>            Reporter: Vladimir Alexiev
>         Attachments: getty-codes.ttl
>
>
> {noformat}
> > riot --validate getty-codes.ttl
> Exception in thread "main" org.apache.jena.atlas.AtlasException: 
> java.nio.charset.MalformedInputException: Input length = 1
>         at org.apache.jena.atlas.io.IO.exception(IO.java:206)
>         at 
> org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:77)
>         at 
> org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:154)
>         at 
> org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:137)
>         at 
> org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:243)
>         at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:237)
>         at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:159)
>         at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:100)
>         at 
> org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:41)
>         at org.apache.jena.riot.RiotReader.createParser(RiotReader.java:131)
>         at riotcmd.CmdLangParse.parseRIOT(CmdLangParse.java:253)
>         at riotcmd.CmdLangParse.parseFile(CmdLangParse.java:182)
>         at riotcmd.CmdLangParse.parseFile(CmdLangParse.java:172)
>         at riotcmd.CmdLangParse.exec(CmdLangParse.java:148)
>         at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
>         at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
>         at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
>         at riotcmd.riot.main(riot.java:35)
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>         at java.nio.charset.CoderResult.throwException(Unknown Source)
>         at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
>         at sun.nio.cs.StreamDecoder.read(Unknown Source)
>         at java.io.InputStreamReader.read(Unknown Source)
>         at java.io.Reader.read(Unknown Source)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to