Re: Upgrade to Tika 1.2 [WAS] Re: [ANNOUNCE] Welcome Peter Ansell as Any23 PPMC member and committer

Peter Ansell Wed, 08 Aug 2012 16:20:22 -0700

On 8 August 2012 19:49, Lewis John Mcgibbney <[email protected]> wrote:
> Hi Michele,
>
> On Wed, Aug 8, 2012 at 10:12 AM, Michele Mostarda
> <[email protected]> wrote:
>
>> Really good initiatives, the only thing I would stress is to avoid breaking
>> the support
>> for IRI in N-Quads[0] present in the current Any23 version of the parser.
>
> My interpretation of what you are saying is that the current current
> Any23 configuration/support for N-Quads does not conform to a standard
> approach for tika mimeType detection and extraction? Is this correct?
> I am not quite getting you here, can you elaborate please?


I interpreted it to mean that Michele would like to keep UTF-8 IRI
support in the parser, as opposed to the standards which specify that
"URI references" (to use the RDF-1.0 spec terminology) must have their
non-US-ASCII and URI reserved characters encoded before they are
written out to N-Triples and N-Quads files. The SesameTools N-Quads
parser that Jeen Broekstra and I are currently favouring sits very
shallowly on top of the current Sesame Rio N-Triples parser/writer
that only supports US-ASCII. By comparison the current Any23 parser
completely implements the N-Triples/N-Quads spec itself including
non-standard features such as unencoded UTF-8 support for both IRIs
and literals, relative URIs, and blank node identifiers that start
with numbers (where the spec says that blank node identifiers must
start with a letter).

>>
>> What I suggest as general approach is to add flags to enforce validation or
>> just to produce
>> warnings when non standard data is detected instead than avoid supporting
>> non fully standard data at all.
>
> +1

I am currently going to develop the parser under SesameTools on GitHub
until the Aduna/Vound CLA's get sorted out.

If you want to fork SesameTools, and develop a patch for my
feature/any23nquadstests branch [1] to modify both
ModifiedNTriplesParser and NQuadsParser to support lenient and strict
parsing modes then it would be great. Once the parser gets put into
Sesame I would like the parser to sit on top of the Sesame Rio
NTriplesParser (which will very likely be replaced with whatever is in
ModifiedNTriplesParser at the time it is put in.)

If we fail to get lenient parsing into the resulting parser, a last
ditch attempt to support people who produce UTF-8 N-Quads documents
may be to keep the Any23 parser around and just import it instead
(while excluding the Sesame Rio N-Quads parser using maven if it gets
introduced as a dependency anywhere).

Cheers,

Peter

[1] https://github.com/ansell/sesametools/tree/feature/any23nquadstests

Re: Upgrade to Tika 1.2 [WAS] Re: [ANNOUNCE] Welcome Peter Ansell as Any23 PPMC member and committer

Reply via email to