Re: GSoC routine

Ying Jiang Sat, 07 Jun 2014 06:27:08 -0700

Dear Andy,

I've just committed the API of PropertyTable with the implementations
[1]. The code follow the original design ( and I paste it below) in
the project proposal, which can accommodate other regular data besides
CSV files:
------
1.2 PropertyTable
A PropertyTable is collection of data that is sufficiently regular in
shape it can be treated as a table. That means each subject has a
value for each one of the set of properties. Irregularity in terms of
missing values needs to be handled but not multiple values for the
same property. With special storage, a PropertyTable
- is more compact and more amenable to custom storage (e.g. a JSON
document store)
- can have custom indexes on specific columns
- can guarantee access orders
Providing these features is out of scope of the project but the
architecture of the work must be mindful of these possibilities.
For this project, PropertyTable is designed to be a table of RDF
terms, or Nodes in Jena. The interface should provide the methods of
getting Nodes by subject or property-value. Using the CSV Paser that
generates tuples of RDF terms, I can add code to take a tabular CSV
file, and create a PropertyTable, or literally its implementation of
PropertyTableImpl, using PropertyTableBuilder. To support testing,
there should be RDF tuples to PropertyTable path through RDF Tuple I/O
[5] as well.
1.3 GraphPropertyTable
GraphPropertyTable implements the Graph interface (readonly) over a
PropertyTable. This is subclass from GraphBase and implements find().
The find() method needs to choose the access route based on the find
arguments. This will offer the PropertyTable interface, or an
appropriate subset or variation, so that such a graph can be treated
in a more table-like fashion.
GraphCSV is a sub class of GraphPropertyTable for aiming at CSV
powered by the CSV Parser.
-----


GraphPropertyTable and GraphCSV have also been implemented. Please
check the test case [2], which realizes the example from the project
proposal that performs SPARQL querying over a GraphCSV.

For now, the API of PropertyTable is enough for performing SPARQL
querying. If the advanced features are required in future, it's
possible to add some methods to PropertyTable, Row or Column. For
example, is PropertyTable supposed to be mutable or read-only? Any
other suggestion for the API, e.g. for an SQL database?

In the next steps, I'd like to refine the code and add more tests, for
more robust CSV parsing and SPARQL querying, especially for the
problems you pointed out in your previous email on 26th May.

Cheers,
Ying Jiang

[1] 
https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/
[2] 
https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/test/java/com/hp/hpl/jena/propertytable/impl/GraphCSVTest.java

On Wed, Jun 4, 2014 at 11:07 PM, Andy Seaborne <[email protected]> wrote:
> Ying,
>
> The next part of the project is the property tables which are compact
> storage for CSV files that exploit the regular structure of the data.
>
> These will be useful both for CSV files but potentially for other uses
> storing regular data (outside this project).  These are what an SQL database
> would call .... a "table" :-)
>
> For the "Design PropertyTable" item, it would be really good to write up and
> share a design on this list.
>
>         Andy
>
>
> On 26/05/14 11:06, Andy Seaborne wrote:
>>
>> On 24/05/14 14:32, Ying Jiang wrote:
>>>
>>> Dear Andy,
>>>
>>> I see the discussion of JENA-699 about the CSV/TSV parser. It seems
>>> that Apache Commons CSV would be a better choice for future.
>>> Therefore, I'm not strictly following the project plan in the proposal
>>> [1], which I'm supposed to develop the CSV parser at the beginning of
>>> the project.
>>
>>
>> It looks very good.
>>
>> As I'm finding on the CSV - working group "CSV" is a somewhat broad
>> catch-all piece of terminology, ranging from using ";" for the separator
>> (common in areas of the world where the decimal number separator is ",")
>> to fixed width layout.  We'll stick to RFC 4180 CSV files for now. There
>> is going to be a revised spec at some time but not soon.
>>
>> One of the advantages of Apache Commons CSV, or other parsers, is the
>> ability to cope with the variety out there.  The CSV parser dropped in
>> recently only does comma separated, properly escaped files.
>>
>> (Honestly, it was quicker to write it that investiage all the existing
>> parser! It was needed quicker for SPARQL test cases.).
>>
>>> Instead, I'm working on  "2.1 RIOT Reader for CSV Files". Things are
>>> going well until now. I just and the new "LangCSV" and its unit test.
>>> Please check the code commited just now. Any comments are welcome!
>>
>>
>> Slight problem:
>>
>> --------------
>> col1, col2
>> abc,"23""4"
>> --------------
>>
>> "23""4" is a CSV field using quotes and "" is an internal escaped double
>> quote charcater - the base CSV parser deals with the quotes.
>>
>> So it the token is the string 23"4
>>
>> You call LangCSV.parse which in turn invokes the tokenizer for Turtle
>> which then complains as 23"4 is a mess in Turtle.
>>
>> There's no need to parse - either it's a string or a double (for now).
>> It's not an a RDF term with language, datatype etc (the SPARQL results
>> in TSV does do that)
>>
>> Fix added - I also abused the parsers use of row/col for CSV errors.
>>
>>>
>>> In the next week, I'd like to complete 2.1, which means Jena can read
>>> ".csv" file into Model.
>>
>>
>> As you do this, more tests to push all the cases are going to be needed,
>> both for more strange cases like the above, and other situations
>> including what happens when a column name has a space in it?  Or other
>> non-URI fragment character in it (answer - %-encode it).
>>
>> For testing the outcome of parsing, you can determine if two models are
>> "the same" by using model.isIsomorphicWith(otherModel)
>>
>> It returns true/false depending on whether there is a consistent
>> renaming of bNodes from one model to the other (that's the isomorphism).
>>
>> So testing can have the right answer as a Turtle model, and compare ti
>> to the parsed CSV file.
>>
>>      Andy
>>
>>>
>>> Best regards,
>>> Ying Jiang
>>>
>>> [1]
>>>
>>> http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080
>>>
>>
>> https://issues.apache.org/jira/browse/JENA-625
>>
>> (I thought all accepted melange proposals became public automatically
>> when accepted and the programme started.  Maybe it'll happen soon.)
>>
>

Re: GSoC routine

Reply via email to