On 25/08/11 07:48, Paolo Castagna wrote:
Hi Monika, (Hi Ian),
Ian has already answered your question.

However, I want to had a similar use case we have in relation to errors or 
malformed
RDF input files. When loading large RDF files we typically use N-Triples or 
N-Quads
and we want to continue parsing the file even if there are a few errors (i.e. 
invalid
lines).

We use RIOT and, even if there is not a feature to tell the parser to ignore an 
error,
skip the line and continue to parse, it's not expensive to construct a LangNQuad
object for each line of your input. So, this is what we often do:

     String line = ...
     Tokenizer tokenizer = 
TokenizerFactory.makeTokenizerString(value.toString());
     LangNQuads parser = new LangNQuads(tokenizer, profile, sink) ;
     parser.parse();

You can then catch all the exception and continue processing the next line.
This happens also when we write MapReduce jobs, for example here [1] or here 
[2]. (*)

Maybe, it's not that difficult to add a feature to RIOT's LangNQuad parser to 
report
errors but skip to the next line and continue parsing. However, I think this is 
close
to impossible for RDF/XML or Turtle serializations.

The recovery also needs to be incorporated in the tokenizer (e.g. missing closing ").

For N-Triples,N-Quads, I think the best way is to use a text processing (regexs, perl etc or Java) on the input to check for basic structural validity before passing onto RIOT. Otherwise tricky cases include missing closing " would need to be caught in the lexer, making it complicated and potentially slower.

It's sort of doable for Turtle. Recovery could be skip to next DOT.

RDF/XML - it's nearly impossible because the error may in the XML structure which is processed by the XML parser, not the RDF/XML parser. It would need help and possibly quite tight integration with the the XML parser itself.

        Andy


Paolo

  [1] 
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/FirstMapper.java
  [2] 
https://github.com/castagna/tdbloader3/blob/master/src/main/java/com/talis/labs/tdb/tdbloader3/io/QuadRecordReader.java


(*)
By the way, if someone wants to help me removing the bottleneck caused by the 
fact
that I am using a single reducer in the first MapReduce job of tdbloader3 or has
ideas on how it could be done, let me know.

Monika Solanki wrote:
Is it possible to check if the incoming data is legal RDF before reading
into the model? I do not want my program to throw an error via
RDFDefaultErrorHandler if the incoming data is illegal RDF. I only want
a warning to be issued and the program should continue execution. If
there are any  supporting examples, that would be very helpful.

Thanks,

Monika

Reply via email to