[
https://issues.apache.org/jira/browse/JENA-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363115#comment-17363115
]
Andy Seaborne edited comment on JENA-2117 at 6/14/21, 7:18 PM:
---------------------------------------------------------------
-[~lgu.spree] Please try Jena 4.1.0-
The code changes are not in 4.1.0 - there is some development in progress that
impacts this area though if it can be handled in Java at all (without rewriting
a Java string implementation!).
The error is (first) on line 900499 where there is: U+D83D, U+DD14 which is a
surrogate pair.
Where did the data come from?
was (Author: andy.seaborne):
[~lgu.spree] Please try Jena 4.1.0
The error is (first) on line 900499 where there is: U+D83D, U+DD14 which is a
surrogate pair. That, and not being sensitive to the JDK supported version of
Unicode can both impact the data.
Where did the data come from?
> Is it possible to ignore RiotParseException in Apache Jena?
> -----------------------------------------------------------
>
> Key: JENA-2117
> URL: https://issues.apache.org/jira/browse/JENA-2117
> Project: Apache Jena
> Issue Type: Question
> Components: RIOT
> Affects Versions: Jena 3.17.0
> Reporter: Luigi Asprino
> Priority: Minor
>
> I'm parsing a file serialized in NQuads format which contains some annoying
> triples having some bad character (the Apache Jena parser throws a
> RiotParseException saying "Bad character encoding"). Is there any way (e.g.
> RDFParser setting) to ignore such exception and go ahead parsing the file?
>
> This is how I read parse the file:
>
> {noformat}
> AtomicInteger ai = new AtomicInteger();
> StreamRDF s = new StreamRDFBase() {
> @Override
> public void triple(Triple triple) {
> collect();
> }
> private void collect() {
> ai.incrementAndGet();
> if (ai.get() % 10000 == 0) {
> System.out.println(ai.get());
> }
> }
> @Override
> public void quad(Quad quad) {
> collect();
> }
> };
> InputStream is = new GZIPInputStream(new FileInputStream(new
> File("data.nq.gz")), 4 * 1024);
>
> RDFParser.create().source(is).lang(Lang.NQUADS).strict(false).parse(s);
> {noformat}
> This is the file I'm trying to read
> https://www.dropbox.com/s/yfrexouusz62m5n/data.nq.gz?dl=0
> The file has a problem (the first, at least) with the line 899908
> {noformat}
> Exception in thread "main" org.apache.jena.riot.RiotException: [line: 899908,
> col: 154] Bad character encoding
> at
> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
> at
> org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
> at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
> at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:72)
> at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:53)
> at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
> at
> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
> at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
> at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
> at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
> at
> org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
> {noformat}
>
> I have experienced this "problem" many times and I found this workaround to
> cope with it.
> {noformat}
> RDFParserBuilder b = RDFParser.create().lang(Lang.NQUADS);
> BufferedReader br = new BufferedReader(new
> InputStreamReader(is), 4 * 1024);
> br.lines().parallel().forEach(l -> {
> try {
> b.fromString(l).parse(s);
> } catch (Exception e) {
> System.err.println(l);
> }
> });
> {noformat}
> But this is slower and works only if the input file has one triple per line.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)