On 9 August 2012 01:19, Peter Ansell <[email protected]> wrote: > On 8 August 2012 19:49, Lewis John Mcgibbney <[email protected]> > wrote: > > Hi Michele, > > > > On Wed, Aug 8, 2012 at 10:12 AM, Michele Mostarda > > <[email protected]> wrote: > > > >> Really good initiatives, the only thing I would stress is to avoid > breaking > >> the support > >> for IRI in N-Quads[0] present in the current Any23 version of the > parser. > > > > My interpretation of what you are saying is that the current current > > Any23 configuration/support for N-Quads does not conform to a standard > > approach for tika mimeType detection and extraction? Is this correct? > > I am not quite getting you here, can you elaborate please? > > I interpreted it to mean that Michele would like to keep UTF-8 IRI > support in the parser, as opposed to the standards which specify that > "URI references" (to use the RDF-1.0 spec terminology) must have their > non-US-ASCII and URI reserved characters encoded before they are > written out to N-Triples and N-Quads files. The SesameTools N-Quads > parser that Jeen Broekstra and I are currently favouring sits very > shallowly on top of the current Sesame Rio N-Triples parser/writer > that only supports US-ASCII. By comparison the current Any23 parser > completely implements the N-Triples/N-Quads spec itself including > non-standard features such as unencoded UTF-8 support for both IRIs > and literals, relative URIs, and blank node identifiers that start > with numbers (where the spec says that blank node identifiers must > start with a letter). > > >> > >> What I suggest as general approach is to add flags to enforce > validation or > >> just to produce > >> warnings when non standard data is detected instead than avoid > supporting > >> non fully standard data at all. > > > > +1 > > I am currently going to develop the parser under SesameTools on GitHub > until the Aduna/Vound CLA's get sorted out. > > If you want to fork SesameTools, and develop a patch for my > feature/any23nquadstests branch [1] to modify both > ModifiedNTriplesParser and NQuadsParser to support lenient and strict > parsing modes then it would be great. Once the parser gets put into > Sesame I would like the parser to sit on top of the Sesame Rio > NTriplesParser (which will very likely be replaced with whatever is in > ModifiedNTriplesParser at the time it is put in.) > +1 Completely agree!
> > If we fail to get lenient parsing into the resulting parser, a last > ditch attempt to support people who produce UTF-8 N-Quads documents > may be to keep the Any23 parser around and just import it instead > (while excluding the Sesame Rio N-Quads parser using maven if it gets > introduced as a dependency anywhere). > +1 > > Cheers, > > Peter > The best. Mic > > [1] https://github.com/ansell/sesametools/tree/feature/any23nquadstests > -- Michele Mostarda Senior Software Engineer skype: michele.mostarda twitter: micmos mail: [email protected] site : http://www.michelemostarda.com
