N3/NQ parsers ignoring stopAtFirstError flag
--------------------------------------------
Key: ANY23-49
URL: https://issues.apache.org/jira/browse/ANY23-49
Project: Apache Any23
Issue Type: Bug
Environment: Any23 0.6.1 and repository
Reporter: Hannes Mühleisen
Attachments: RobustNquadsParser.java
The base interface for all RDF parsers (org.openrdf.rio.RDFParser) defines a
method setStopAtFirstError. The documentation for this methods reads as "Sets
whether the parser should stop immediately if it finds an error in the data".
This is indeed very useful, as many data sets "out there" contain an amount of
malformed entries.
However, as far as I can tell from the current source code (0.6.1 and SVN
trunk), both the NTriples parser
(org.openrdf.rio.ntriples.NTriplesParser.NTriplesParser) and the NQuadsParser
(org.deri.any23.parser.NQuadsParser) ignore this flag. In their respective
implementations, they run through the entire files in an unchecked loop (see
http://code.google.com/p/any23/source/browse/trunk/any23-core/src/main/java/org/deri/any23/io/nquads/NQuadsParser.java#100).
while(parseLine(fileReader)) {
nextRow();
}
Now, if the parsing of any line in a potential huge file throws an exception,
the entire parsing process stops regardless of the setting of the
"stopAtFirstError" flag. I propose these loops to be changed to honor this
flag, so that when it is set to "false", the rest of the line is discarded and
the parsing process can continue with the next line.
I have implemented this behavior on the latest version of NQuadsParser from SVN
(r1601), the source file is attached. I have changed the parseLine() method as
follows:
private boolean parseLine(BufferedReader br) throws IOException,
RDFParseException, RDFHandlerException {
// [...]
try {
// [...]
// notifiyStatement moved into try block
notifyStatement(sub, pred, obj, graph);
} catch (EOS eos) {
reportFatalError("Unexpected end of line.", row, col);
throw new IllegalStateException();
} catch (IllegalArgumentException iae) {
if (!stopAtFirstError()) {
// remove remainder of broken line
consumeBrokenLine(br);
// notify parse error listener
reportError(iae.getMessage(), row, col);
} else {
throw new RDFParseException(iae);
}
}
// [...]
}
private void consumeBrokenLine(BufferedReader br) throws IOException {
char c;
while (true) {
mark(br);
c = readChar(br);
if (c == '\n') {
return;
}
}
}
It would be great if this or similar changes would find their way into the
Any23 parsers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira