Re: GSoC routine

Andy Seaborne Mon, 26 May 2014 03:07:22 -0700

On 24/05/14 14:32, Ying Jiang wrote:

Dear Andy,


I see the discussion of JENA-699 about the CSV/TSV parser. It seems
that Apache Commons CSV would be a better choice for future.
Therefore, I'm not strictly following the project plan in the proposal
[1], which I'm supposed to develop the CSV parser at the beginning of
the project.


It looks very good.

As I'm finding on the CSV - working group "CSV" is a somewhat broadcatch-all piece of terminology, ranging from using ";" for the separator(common in areas of the world where the decimal number separator is ",")to fixed width layout. We'll stick to RFC 4180 CSV files for now.There is going to be a revised spec at some time but not soon.

One of the advantages of Apache Commons CSV, or other parsers, is theability to cope with the variety out there. The CSV parser dropped inrecently only does comma separated, properly escaped files.

(Honestly, it was quicker to write it that investiage all the existingparser! It was needed quicker for SPARQL test cases.).

Instead, I'm working on  "2.1 RIOT Reader for CSV Files". Things are
going well until now. I just and the new "LangCSV" and its unit test.
Please check the code commited just now. Any comments are welcome!


Slight problem:

--------------
col1, col2
abc,"23""4"
--------------

"23""4" is a CSV field using quotes and "" is an internal escaped doublequote charcater - the base CSV parser deals with the quotes.


So it the token is the string 23"4

You call LangCSV.parse which in turn invokes the tokenizer for Turtlewhich then complains as 23"4 is a mess in Turtle.

There's no need to parse - either it's a string or a double (for now).It's not an a RDF term with language, datatype etc (the SPARQL resultsin TSV does do that)


Fix added - I also abused the parsers use of row/col for CSV errors.


In the next week, I'd like to complete 2.1, which means Jena can read
".csv" file into Model.

As you do this, more tests to push all the cases are going to be needed,both for more strange cases like the above, and other situationsincluding what happens when a column name has a space in it? Or othernon-URI fragment character in it (answer - %-encode it).

For testing the outcome of parsing, you can determine if two models are"the same" by using model.isIsomorphicWith(otherModel)

It returns true/false depending on whether there is a consistentrenaming of bNodes from one model to the other (that's the isomorphism).

So testing can have the right answer as a Turtle model, and compare tito the parsed CSV file.


        Andy


Best regards,
Ying Jiang

[1] 
http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080


https://issues.apache.org/jira/browse/JENA-625

(I thought all accepted melange proposals became public automaticallywhen accepted and the programme started. Maybe it'll happen soon.)

Re: GSoC routine

Reply via email to