[
https://issues.apache.org/jira/browse/JENA-641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13901576#comment-13901576
]
Andy Seaborne edited comment on JENA-641 at 2/14/14 4:03 PM:
-------------------------------------------------------------
Not a bug.
The java charset converters don't say where the encoding fails. Normally, you
get junk characters (e.g. ISO-8859-1 read as UTF-8 most of the time passes
though but is non-codepoints); occasionally, the actual UTF-8 encoding is
illegal. Java offers to put in a "bad character" marker or return an error.
RIOT chooses the error case because to scan for the bad character afterwards is
costly and this is critical path for parsing.
To find a bad encoding we would need to do something like:
# Switch to bad character markers and post-scan the output
# Have our own UTF-8 to character converter.
# Keep a copy of input and either switch to our own converter or submit small
chunks to find the error.
I have tried (2) before - there is a code-based, not lookup table based, input
stream reader (InStreamUTF8). It's slower than using the Java native decoders
even though it does one less copy of data. Presumably the Java native decoders
are in C (they are native code) and with the care they are written, this
outweighs the extra copy. The speed difference even when run inside a parser
doing tokenizing and grammar rules as well is measurable.
To do (3) needs to keep to keep a copy of the data as it flows through which
will impact performance measurably. Similarly, (1) is second pass over the
data, with a bad data cache pattern.
was (Author: andy.seaborne):
Not a bug.
The java charset converters don't say where the encoding fails. Normally, you
get junk charcaters (e.g. ISO-8859-1 read as UTF-8 most of the time passes
though but is non-codepoints); occasionally, the actual UTF-8 encoding is
illegal. Java offers to put in a "bad character" marker or return an error.
RIOT chooses the error case because to scan for the bad character afterwards is
costly and this is critical path for parsing.
To find a bad encoding we would need to do something like:
# Switch to bad character makrers and post-scan the output
# Have our own UTF-8 to character converter.
# Keep a copy of input and either switch to our own converter or submit small
chunks to find the error.
I have tried (2) before - there is a code-based, not lookup table based, input
stream reader. It's slower than using the Java native decoders even though it
does one less copy of data. Presumably the Java native decoders are in C (they
are native code) and with teh care they are written, this outweighs the extra
copy. The speed difference even when run inside a parser doing tokenizing and
grammar as well is measurable.
To do (3) needs to keep to keep a copy of the data as it flows through which
will impact performance measurably. Similarly, (1) is a appreciable cost.
> org.apache.jena.atlas.AtlasException on particular Turtle file
> --------------------------------------------------------------
>
> Key: JENA-641
> URL: https://issues.apache.org/jira/browse/JENA-641
> Project: Apache Jena
> Issue Type: Bug
> Components: RIOT
> Affects Versions: Jena 2.11.1
> Reporter: Vladimir Alexiev
> Attachments: getty-codes.ttl
>
>
> {noformat}
> > riot --validate getty-codes.ttl
> Exception in thread "main" org.apache.jena.atlas.AtlasException:
> java.nio.charset.MalformedInputException: Input length = 1
> at org.apache.jena.atlas.io.IO.exception(IO.java:206)
> at
> org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:77)
> at
> org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:154)
> at
> org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:137)
> at
> org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:243)
> at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:237)
> at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:159)
> at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:100)
> at
> org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:41)
> at org.apache.jena.riot.RiotReader.createParser(RiotReader.java:131)
> at riotcmd.CmdLangParse.parseRIOT(CmdLangParse.java:253)
> at riotcmd.CmdLangParse.parseFile(CmdLangParse.java:182)
> at riotcmd.CmdLangParse.parseFile(CmdLangParse.java:172)
> at riotcmd.CmdLangParse.exec(CmdLangParse.java:148)
> at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
> at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
> at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
> at riotcmd.riot.main(riot.java:35)
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
> at java.nio.charset.CoderResult.throwException(Unknown Source)
> at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
> at sun.nio.cs.StreamDecoder.read(Unknown Source)
> at java.io.InputStreamReader.read(Unknown Source)
> at java.io.Reader.read(Unknown Source)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)