[jira] [Updated] (ANY23-417) Inherent problems with mimetype detection

Hans Brende (JIRA) Thu, 01 Nov 2018 08:26:16 -0700


     [ 
https://issues.apache.org/jira/browse/ANY23-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hans Brende updated ANY23-417:
------------------------------
    Description: 
N-Triples is a subset of Turtle, and it is also a subset of N-Quads. Turtle is 
a subset of TriG.

But when we are performing mimetype detection on a plain text file, we only 
sniff the first few kilobytes of data. Therefore, something we initially detect 
as N-Triples may in fact be a Turtle, Trig, or NQuads document. Something we 
initially detect as Turtle may in fact be a TriG document.

Therefore, if we detect that the document is Turtle, in the absence of a 
declared Content-Type, we should probably assume that it actually TriG, just in 
case.

If we can only detect that the document is N-Triples, that presents a problem, 
because it could also be either Turtle or N-Quads. Which do we choose?

Another problem I see is that we are detecting both N3 and Turtle in two 
separate steps. However, as I understand it, for the purposes of RDF, N3 is 
essentially a synonym for Turtle. So it doesn't really make sense to use two 
different detection steps for this. It appears that our N3 detection step is 
actually detecting N-Triples, which is not at all the same thing.

(Indeed, in {{org.eclipse.rdf4j.rio.n3.N3ParserFactory}}'s implementation of 
{{getParser()}} we see: {{return new TurtleParser()}})



  was:
N-Triples is a subset of Turtle, and it is also a subset of N-Quads. Turtle is 
a subset of TriG.

But when we are performing mimetype detection on a plain text file, we only 
sniff the first few kilobytes of data. Therefore, something we initially detect 
as N-Triples may in fact be a Turtle, Trig, or NQuads document. Something we 
initially detect as Turtle may in fact be a TriG document.

Therefore, if we detect that the document is Turtle, in the absence of a 
declared Content-Type, we should probably assume that it actually TriG, just in 
case.

If we can only detect that the document is N-Triples, that presents a problem, 
because it could also be either Turtle or N-Quads. Which do we choose?

Another problem I see is that we are detecting both N3 and Turtle in two 
separate steps. However, as I understand it, for the purposes of RDF, N3 is 
essentially a synonym for Turtle. So it doesn't really make sense to use two 
different detection steps for this. It appears that our N3 detection step is 
actually detecting N-Triples, which is not at all the same thing.




> Inherent problems with mimetype detection
> -----------------------------------------
>
>                 Key: ANY23-417
>                 URL: https://issues.apache.org/jira/browse/ANY23-417
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> N-Triples is a subset of Turtle, and it is also a subset of N-Quads. Turtle 
> is a subset of TriG.
> But when we are performing mimetype detection on a plain text file, we only 
> sniff the first few kilobytes of data. Therefore, something we initially 
> detect as N-Triples may in fact be a Turtle, Trig, or NQuads document. 
> Something we initially detect as Turtle may in fact be a TriG document.
> Therefore, if we detect that the document is Turtle, in the absence of a 
> declared Content-Type, we should probably assume that it actually TriG, just 
> in case.
> If we can only detect that the document is N-Triples, that presents a 
> problem, because it could also be either Turtle or N-Quads. Which do we 
> choose?
> Another problem I see is that we are detecting both N3 and Turtle in two 
> separate steps. However, as I understand it, for the purposes of RDF, N3 is 
> essentially a synonym for Turtle. So it doesn't really make sense to use two 
> different detection steps for this. It appears that our N3 detection step is 
> actually detecting N-Triples, which is not at all the same thing.
> (Indeed, in {{org.eclipse.rdf4j.rio.n3.N3ParserFactory}}'s implementation of 
> {{getParser()}} we see: {{return new TurtleParser()}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ANY23-417) Inherent problems with mimetype detection

Reply via email to