On 24/05/14 14:32, Ying Jiang wrote:
Dear Andy,

I see the discussion of JENA-699 about the CSV/TSV parser. It seems
that Apache Commons CSV would be a better choice for future.
Therefore, I'm not strictly following the project plan in the proposal
[1], which I'm supposed to develop the CSV parser at the beginning of
the project.

It looks very good.

As I'm finding on the CSV - working group "CSV" is a somewhat broad catch-all piece of terminology, ranging from using ";" for the separator (common in areas of the world where the decimal number separator is ",") to fixed width layout. We'll stick to RFC 4180 CSV files for now. There is going to be a revised spec at some time but not soon.

One of the advantages of Apache Commons CSV, or other parsers, is the ability to cope with the variety out there. The CSV parser dropped in recently only does comma separated, properly escaped files.

(Honestly, it was quicker to write it that investiage all the existing parser! It was needed quicker for SPARQL test cases.).

Instead, I'm working on  "2.1 RIOT Reader for CSV Files". Things are
going well until now. I just and the new "LangCSV" and its unit test.
Please check the code commited just now. Any comments are welcome!

Slight problem:

--------------
col1, col2
abc,"23""4"
--------------

"23""4" is a CSV field using quotes and "" is an internal escaped double quote charcater - the base CSV parser deals with the quotes.

So it the token is the string 23"4

You call LangCSV.parse which in turn invokes the tokenizer for Turtle which then complains as 23"4 is a mess in Turtle.

There's no need to parse - either it's a string or a double (for now). It's not an a RDF term with language, datatype etc (the SPARQL results in TSV does do that)

Fix added - I also abused the parsers use of row/col for CSV errors.


In the next week, I'd like to complete 2.1, which means Jena can read
".csv" file into Model.

As you do this, more tests to push all the cases are going to be needed, both for more strange cases like the above, and other situations including what happens when a column name has a space in it? Or other non-URI fragment character in it (answer - %-encode it).

For testing the outcome of parsing, you can determine if two models are "the same" by using model.isIsomorphicWith(otherModel)

It returns true/false depending on whether there is a consistent renaming of bNodes from one model to the other (that's the isomorphism).

So testing can have the right answer as a Turtle model, and compare ti to the parsed CSV file.

        Andy


Best regards,
Ying Jiang

[1] 
http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080

https://issues.apache.org/jira/browse/JENA-625

(I thought all accepted melange proposals became public automatically when accepted and the programme started. Maybe it'll happen soon.)

Reply via email to