Re: Upgrade to Tika 1.2 [WAS] Re: [ANNOUNCE] Welcome Peter Ansell as Any23 PPMC member and committer

Michele Mostarda Sun, 19 Aug 2012 03:09:42 -0700

On 9 August 2012 01:19, Peter Ansell <[email protected]> wrote:

> On 8 August 2012 19:49, Lewis John Mcgibbney <[email protected]>
> wrote:
> > Hi Michele,
> >
> > On Wed, Aug 8, 2012 at 10:12 AM, Michele Mostarda
> > <[email protected]> wrote:
> >
> >> Really good initiatives, the only thing I would stress is to avoid
> breaking
> >> the support
> >> for IRI in N-Quads[0] present in the current Any23 version of the
> parser.
> >
> > My interpretation of what you are saying is that the current current
> > Any23 configuration/support for N-Quads does not conform to a standard
> > approach for tika mimeType detection and extraction? Is this correct?
> > I am not quite getting you here, can you elaborate please?
>
> I interpreted it to mean that Michele would like to keep UTF-8 IRI
> support in the parser, as opposed to the standards which specify that
> "URI references" (to use the RDF-1.0 spec terminology) must have their
> non-US-ASCII and URI reserved characters encoded before they are
> written out to N-Triples and N-Quads files. The SesameTools N-Quads
> parser that Jeen Broekstra and I are currently favouring sits very
> shallowly on top of the current Sesame Rio N-Triples parser/writer
> that only supports US-ASCII. By comparison the current Any23 parser
> completely implements the N-Triples/N-Quads spec itself including
> non-standard features such as unencoded UTF-8 support for both IRIs
> and literals, relative URIs, and blank node identifiers that start
> with numbers (where the spec says that blank node identifiers must
> start with a letter).
>
> >>
> >> What I suggest as general approach is to add flags to enforce
> validation or
> >> just to produce
> >> warnings when non standard data is detected instead than avoid
> supporting
> >> non fully standard data at all.
> >
> > +1
>
> I am currently going to develop the parser under SesameTools on GitHub
> until the Aduna/Vound CLA's get sorted out.
>
> If you want to fork SesameTools, and develop a patch for my
> feature/any23nquadstests branch [1] to modify both
> ModifiedNTriplesParser and NQuadsParser to support lenient and strict
> parsing modes then it would be great. Once the parser gets put into
> Sesame I would like the parser to sit on top of the Sesame Rio
> NTriplesParser (which will very likely be replaced with whatever is in
> ModifiedNTriplesParser at the time it is put in.)
>
+1
Completely agree!



>
> If we fail to get lenient parsing into the resulting parser, a last
> ditch attempt to support people who produce UTF-8 N-Quads documents
> may be to keep the Any23 parser around and just import it instead
> (while excluding the Sesame Rio N-Quads parser using maven if it gets
> introduced as a dependency anywhere).
>

+1


>
> Cheers,
>
> Peter
>

The best.
Mic


>
> [1] https://github.com/ansell/sesametools/tree/feature/any23nquadstests
>



-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: [email protected]
site : http://www.michelemostarda.com

Re: Upgrade to Tika 1.2 [WAS] Re: [ANNOUNCE] Welcome Peter Ansell as Any23 PPMC member and committer

Reply via email to