Re: Turtle file with UTF-8 BOM fails to parse

Andy Seaborne Sun, 19 Dec 2010 09:46:06 -0800

Rob,

Thanks - and fixed in RIOT (ARQ SVN on SF) and the current Jena readers(Jena CVS on SF). It covers TriG as well.

The same is true for SPARQL - the most direct fix is to skip BOM at thestart of the parse but to do that requires a grammar change. That'swhat I did for Turte/N3 in Jena. But the SPARQL grammar is the grammarused to produce the spec HTML and I don't want to contaminate the spec.

For now, I've added a wrapper that will remove a leading BOM so it'sfunctionally correct and can remove the wrapper, move the BOM processingto the grammar when the spec is frozen.


        Andy

On 18/12/10 13:09, Rob Vesse wrote:

Hi Andy

I've created a JIRA issue for this -
https://issues.apache.org/jira/browse/JENA-12

I appreciate the need for minimal, complete examples as I have enough
trouble getting those out of users on my own support lists

Thanks,

Rob

On Fri, 17 Dec 2010 14:10:09 +0000, Andy Seaborne
<[email protected]>  wrote:

Hi Rob,

Thanks for the minimal, complete, example.

The parsers should cope with a UTF-8 BOM even if it's not recommended.

Could you raise a JIRA issue for this please (it's the new process!).
It'll need fixing in Jena and RIOT.

        Andy

On 17/12/10 11:42, Rob Vesse wrote:

Hi all

I had this issue reported to me recently and have been able to confirm
it myself (example data file attached). Essentially the issue is that if
a Turtle file has a BOM at the start then Jena will refuse to parse it
giving the following error:

Exception in thread "main"
com.hp.hpl.jena.n3.turtle.TurtleParseException: Lexical error at line 1,
column 2. Encountered: "@" (64), after : "\ufeff"
at com.hp.hpl.jena.n3.turtle.ParserTurtle.parse(ParserTurtle.java:44)
at
com.hp.hpl.jena.n3.turtle.TurtleReader.readWorker(TurtleReader.java:21)
at com.hp.hpl.jena.n3.JenaReaderBase.readImpl(JenaReaderBase.java:101)
at com.hp.hpl.jena.n3.JenaReaderBase.read(JenaReaderBase.java:68)
at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:226)
at TurtleWithBOM.main(TurtleWithBOM.java:31)

The code I used to produce this error was as follows:

import com.hp.hpl.jena.rdf.model.*;
import com.hp.hpl.jena.util.FileManager;

import java.io.*;

public class TurtleWithBOM
{

public static void main(String[] args)
{

// create an empty model
Model model = ModelFactory.createDefaultModel();

InputStream in = FileManager.get().open( "ttl-with-bom.ttl" );
if (in == null)
{
throw new IllegalArgumentException( "File: ttl-with-bom.ttl not found");
}

// read the Turtle file
model.read(in, "", "TTL");

// write it to standard out
model.write(System.out);
}
}

A sample data file used with the above code to reproduce the error is
attached.

The data files are coming from my software which is all written in .Net
and when outputting in UTF-8 the default behaviour of .Net is to include
the BOM at the start of the file. The BOM is not required for UTF-8 but
it is not forbidden so I think this should be fixed (if possible) for
future releases. I will be modifying my software so that output of the
BOM can be disabled by my users if desired

Looking at the error message given I expect that the same problem would
also affect N3 files since they are using the same reader afaict from
the error trace.

Regards,

Rob Vesse

--
PhD Student
IAM Group
Bay 20, Room 4027, Building 32
Electronics&   Computer Science
University of Southampton

Re: Turtle file with UTF-8 BOM fails to parse

Reply via email to