On 8 August 2012 19:49, Lewis John Mcgibbney <[email protected]> wrote: > Hi Michele, > > On Wed, Aug 8, 2012 at 10:12 AM, Michele Mostarda > <[email protected]> wrote: > >> Really good initiatives, the only thing I would stress is to avoid breaking >> the support >> for IRI in N-Quads[0] present in the current Any23 version of the parser. > > My interpretation of what you are saying is that the current current > Any23 configuration/support for N-Quads does not conform to a standard > approach for tika mimeType detection and extraction? Is this correct? > I am not quite getting you here, can you elaborate please?
I interpreted it to mean that Michele would like to keep UTF-8 IRI support in the parser, as opposed to the standards which specify that "URI references" (to use the RDF-1.0 spec terminology) must have their non-US-ASCII and URI reserved characters encoded before they are written out to N-Triples and N-Quads files. The SesameTools N-Quads parser that Jeen Broekstra and I are currently favouring sits very shallowly on top of the current Sesame Rio N-Triples parser/writer that only supports US-ASCII. By comparison the current Any23 parser completely implements the N-Triples/N-Quads spec itself including non-standard features such as unencoded UTF-8 support for both IRIs and literals, relative URIs, and blank node identifiers that start with numbers (where the spec says that blank node identifiers must start with a letter). >> >> What I suggest as general approach is to add flags to enforce validation or >> just to produce >> warnings when non standard data is detected instead than avoid supporting >> non fully standard data at all. > > +1 I am currently going to develop the parser under SesameTools on GitHub until the Aduna/Vound CLA's get sorted out. If you want to fork SesameTools, and develop a patch for my feature/any23nquadstests branch [1] to modify both ModifiedNTriplesParser and NQuadsParser to support lenient and strict parsing modes then it would be great. Once the parser gets put into Sesame I would like the parser to sit on top of the Sesame Rio NTriplesParser (which will very likely be replaced with whatever is in ModifiedNTriplesParser at the time it is put in.) If we fail to get lenient parsing into the resulting parser, a last ditch attempt to support people who produce UTF-8 N-Quads documents may be to keep the Any23 parser around and just import it instead (while excluding the Sesame Rio N-Quads parser using maven if it gets introduced as a dependency anywhere). Cheers, Peter [1] https://github.com/ansell/sesametools/tree/feature/any23nquadstests
