[ 
https://issues.apache.org/jira/browse/JENA-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363115#comment-17363115
 ] 

Andy Seaborne edited comment on JENA-2117 at 6/14/21, 7:18 PM:
---------------------------------------------------------------

-[~lgu.spree] Please try Jena 4.1.0-

The code changes are not in 4.1.0 - there is some development in progress that 
impacts this area though if it can be handled in Java at all (without rewriting 
a Java string implementation!).

The error is (first) on line 900499 where there is: U+D83D, U+DD14 which is a 
surrogate pair. 
Where did the data come from?



was (Author: andy.seaborne):
[~lgu.spree] Please try Jena 4.1.0

The error is (first) on line 900499 where there is: U+D83D, U+DD14 which is a 
surrogate pair. That, and not being sensitive to the JDK supported version of 
Unicode can both impact the data.

Where did the data come from?


> Is it possible to ignore RiotParseException in Apache Jena?
> -----------------------------------------------------------
>
>                 Key: JENA-2117
>                 URL: https://issues.apache.org/jira/browse/JENA-2117
>             Project: Apache Jena
>          Issue Type: Question
>          Components: RIOT
>    Affects Versions: Jena 3.17.0
>            Reporter: Luigi Asprino
>            Priority: Minor
>
> I'm parsing a file serialized in NQuads format which contains some annoying 
> triples having some bad character (the Apache Jena parser throws a 
> RiotParseException saying "Bad character encoding"). Is there any way (e.g. 
> RDFParser setting) to ignore such exception and go ahead parsing the file?
>  
> This is how I read parse the file:
>  
> {noformat}
>  AtomicInteger ai = new AtomicInteger();
>               StreamRDF s = new StreamRDFBase() {
>                       @Override
>                       public void triple(Triple triple) {
>                               collect();
>                       }
>                       private void collect() {
>                               ai.incrementAndGet();
>                               if (ai.get() % 10000 == 0) {
>                                       System.out.println(ai.get());
>                               }
>                       }
>                       @Override
>                       public void quad(Quad quad) {
>                               collect();
>                       }
>               };
>               InputStream is = new GZIPInputStream(new FileInputStream(new 
> File("data.nq.gz")), 4 * 1024);
>               
> RDFParser.create().source(is).lang(Lang.NQUADS).strict(false).parse(s);
> {noformat}
> This is the file I'm trying to read 
> https://www.dropbox.com/s/yfrexouusz62m5n/data.nq.gz?dl=0
> The file has a problem (the first, at least) with the line 899908
> {noformat}
> Exception in thread "main" org.apache.jena.riot.RiotException: [line: 899908, 
> col: 154] Bad character encoding
>       at 
> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
>       at 
> org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
>       at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
>       at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:72)
>       at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:53)
>       at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
>       at 
> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
>       at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
>       at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
>       at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
>       at 
> org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
> {noformat}
>  
> I have experienced this "problem" many times and I found this workaround to 
> cope with it. 
> {noformat}
>              RDFParserBuilder b = RDFParser.create().lang(Lang.NQUADS);
>               BufferedReader br = new BufferedReader(new 
> InputStreamReader(is), 4 * 1024);
>               br.lines().parallel().forEach(l -> {
>                       try {
>                               b.fromString(l).parse(s);
>                       } catch (Exception e) {
>                               System.err.println(l);
>                       }
>               });
> {noformat}
> But this is slower and works only if the input file has one triple per line. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to