[ 
https://issues.apache.org/jira/browse/JENA-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Seaborne updated JENA-2117:
--------------------------------
    Description: 
I'm parsing a file serialized in NQuads format which contains some annoying 
triples having some bad character (the Apache Jena parser throws a 
RiotParseException saying "Bad character encoding"). Is there any way (e.g. 
RDFParser setting) to ignore such exception and go ahead parsing the file?
 
This is how I read parse the file:
 
{noformat}
 AtomicInteger ai = new AtomicInteger();
                StreamRDF s = new StreamRDFBase() {

                        @Override
                        public void triple(Triple triple) {
                                collect();
                        }

                        private void collect() {
                                ai.incrementAndGet();
                                if (ai.get() % 10000 == 0) {
                                        System.out.println(ai.get());
                                }

                        }

                        @Override
                        public void quad(Quad quad) {
                                collect();
                        }

                };

                InputStream is = new GZIPInputStream(new FileInputStream(new 
File("data.nq.gz")), 4 * 1024);
                
RDFParser.create().source(is).lang(Lang.NQUADS).strict(false).parse(s);
{noformat}

This is the file I'm trying to read 
https://www.dropbox.com/s/yfrexouusz62m5n/data.nq.gz?dl=0
The file has a problem (the first, at least) with the line 899908

{noformat}
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 899908, 
col: 154] Bad character encoding
        at 
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
        at 
org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
        at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
        at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:72)
        at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:53)
        at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
        at 
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
        at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
        at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
        at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
        at 
org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
{noformat}
 
I have experienced this "problem" many times and I found this workaround to 
cope with it. 

{noformat}
             RDFParserBuilder b = RDFParser.create().lang(Lang.NQUADS);

                BufferedReader br = new BufferedReader(new 
InputStreamReader(is), 4 * 1024);
                br.lines().parallel().forEach(l -> {
                        try {
                                b.fromString(l).parse(s);
                        } catch (Exception e) {
                                System.err.println(l);
                        }

                });
{noformat}
But this is slower and works only if the input file has one triple per line. 

  was:
I'm parsing a file serialized in NQuads format which contains some annoying 
triples having some bad character (the Apache Jena parser throws a 
RiotParseException saying "Bad character encoding"). Is there any way (e.g. 
RDFParser setting) to ignore such exception and go ahead parsing the file?
 
This is how I read parse the file:
 
```
 AtomicInteger ai = new AtomicInteger();
                StreamRDF s = new StreamRDFBase() {

                        @Override
                        public void triple(Triple triple) {
                                collect();
                        }

                        private void collect() {
                                ai.incrementAndGet();
                                if (ai.get() % 10000 == 0) {
                                        System.out.println(ai.get());
                                }

                        }

                        @Override
                        public void quad(Quad quad) {
                                collect();
                        }

                };

                InputStream is = new GZIPInputStream(new FileInputStream(new 
File("data.nq.gz")), 4 * 1024);
                
RDFParser.create().source(is).lang(Lang.NQUADS).strict(false).parse(s);
```

This is the file I'm trying to read 
https://www.dropbox.com/s/yfrexouusz62m5n/data.nq.gz?dl=0
The file has a problem (the first, at least) with the line 899908

```
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 899908, 
col: 154] Bad character encoding
        at 
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
        at 
org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
        at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
        at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:72)
        at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:53)
        at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
        at 
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
        at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
        at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
        at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
        at 
org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
```
 
I have experienced this "problem" many times and I found this workaround to 
cope with it. 

```
             RDFParserBuilder b = RDFParser.create().lang(Lang.NQUADS);

                BufferedReader br = new BufferedReader(new 
InputStreamReader(is), 4 * 1024);
                br.lines().parallel().forEach(l -> {
                        try {
                                b.fromString(l).parse(s);
                        } catch (Exception e) {
                                System.err.println(l);
                        }

                });
```
But this is slower and works only if the input file has one triple per line. 


> Is it possible to ignore RiotParseException in Apache Jena?
> -----------------------------------------------------------
>
>                 Key: JENA-2117
>                 URL: https://issues.apache.org/jira/browse/JENA-2117
>             Project: Apache Jena
>          Issue Type: Question
>          Components: RIOT
>    Affects Versions: Jena 3.17.0
>            Reporter: Luigi Asprino
>            Priority: Minor
>
> I'm parsing a file serialized in NQuads format which contains some annoying 
> triples having some bad character (the Apache Jena parser throws a 
> RiotParseException saying "Bad character encoding"). Is there any way (e.g. 
> RDFParser setting) to ignore such exception and go ahead parsing the file?
>  
> This is how I read parse the file:
>  
> {noformat}
>  AtomicInteger ai = new AtomicInteger();
>               StreamRDF s = new StreamRDFBase() {
>                       @Override
>                       public void triple(Triple triple) {
>                               collect();
>                       }
>                       private void collect() {
>                               ai.incrementAndGet();
>                               if (ai.get() % 10000 == 0) {
>                                       System.out.println(ai.get());
>                               }
>                       }
>                       @Override
>                       public void quad(Quad quad) {
>                               collect();
>                       }
>               };
>               InputStream is = new GZIPInputStream(new FileInputStream(new 
> File("data.nq.gz")), 4 * 1024);
>               
> RDFParser.create().source(is).lang(Lang.NQUADS).strict(false).parse(s);
> {noformat}
> This is the file I'm trying to read 
> https://www.dropbox.com/s/yfrexouusz62m5n/data.nq.gz?dl=0
> The file has a problem (the first, at least) with the line 899908
> {noformat}
> Exception in thread "main" org.apache.jena.riot.RiotException: [line: 899908, 
> col: 154] Bad character encoding
>       at 
> org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:153)
>       at 
> org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
>       at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
>       at org.apache.jena.riot.lang.LangNQuads.parseOne(LangNQuads.java:72)
>       at org.apache.jena.riot.lang.LangNQuads.runParser(LangNQuads.java:53)
>       at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
>       at 
> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
>       at org.apache.jena.riot.RDFParser.read(RDFParser.java:353)
>       at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:343)
>       at org.apache.jena.riot.RDFParser.parse(RDFParser.java:292)
>       at 
> org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:540)
> {noformat}
>  
> I have experienced this "problem" many times and I found this workaround to 
> cope with it. 
> {noformat}
>              RDFParserBuilder b = RDFParser.create().lang(Lang.NQUADS);
>               BufferedReader br = new BufferedReader(new 
> InputStreamReader(is), 4 * 1024);
>               br.lines().parallel().forEach(l -> {
>                       try {
>                               b.fromString(l).parse(s);
>                       } catch (Exception e) {
>                               System.err.println(l);
>                       }
>               });
> {noformat}
> But this is slower and works only if the input file has one triple per line. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to