Re: GSoC routine

Andy Seaborne Mon, 09 Jun 2014 05:59:26 -0700

Hi there,

I think I see what's going on but could you write some javadoc for theinterfaces please?

RDF blank nodes have their own rules and requirements. Each row issupposed to be allocated a blank node a subject for each row and a lateraccess may be using that blank node. In the general case, and the WG isworking on this, the subject might even by a URI calculated from the row.


I'm afraid you can't do:

PropertyTableImpl::getTripleIterator(Column)
Node s  = NodeFactory.createAnon(AnonId.create( "_:"+rowNum  ));

(although it's very very useful to have that capability for debugging!The LabelToNode code in ARQ, with a new policy will help you to abstractthis.)

because different tables end up with the same bNode for each row 1 etc.You'll need to create a subject bNode (and let Jena choose the labelso it's unique for every table read) as the table is read.

Then for the find operation, it would good if access by subject couldpick a row directly, rather than a scan of the property table. This isone case of having additional indexes into the data. Is the designsuitable for this kind of access?


Discussion points:

1/

For now, the API of PropertyTable is enough for performing SPARQL
querying. If the advanced features are required in future, it's
possible to add some methods to PropertyTable, Row or Column. For
example, is PropertyTable supposed to be mutable or read-only? Any
other suggestion for the API, e.g. for an SQL database?

Yes, because it provides find() it can perform SPARQL queries but itmay be quite slow as it can involve a whole table-scan. Could you makesome suggestions as to how exploit the datastructures so SPARQL canexecute efficiently on CSV data. For example, is there a way thatSUM/COUNT and the other aggregates on a column can be handled. (At thisstage, it's design work not implementation - it will help verify theimplementation in PropertyTableImpl.java has the right datastructures.)

2/

I'd like to get your opinion of storing a property table as a Javaarray, indexec by row number because a array is more compact than a hashmap. (I'm trying to understand is the space taken by PropertyTableImplcan be reduced - if we think of reading in either large tables or manytables, or both(!!!), then space can become an issue)


        Andy


On 07/06/14 14:26, Ying Jiang wrote:

Dear Andy,

I've just committed the API of PropertyTable with the implementations
[1]. The code follow the original design ( and I paste it below) in
the project proposal, which can accommodate other regular data besides
CSV files:
------
1.2 PropertyTable
A PropertyTable is collection of data that is sufficiently regular in
shape it can be treated as a table. That means each subject has a
value for each one of the set of properties. Irregularity in terms of
missing values needs to be handled but not multiple values for the
same property. With special storage, a PropertyTable
- is more compact and more amenable to custom storage (e.g. a JSON
document store)
- can have custom indexes on specific columns
- can guarantee access orders
Providing these features is out of scope of the project but the
architecture of the work must be mindful of these possibilities.
For this project, PropertyTable is designed to be a table of RDF
terms, or Nodes in Jena. The interface should GraphPropertyTableprovide the
methods of
getting Nodes by subject or property-value. Using the CSV Paser that
generates tuples of RDF terms, I can add code to take a tabular CSV
file, and create a PropertyTable, or literally its implementation of
PropertyTableImpl, using PropertyTableBuilder. To support testing,
there should be RDF tuples to PropertyTable path through RDF Tuple I/O
[5] as well.
1.3 GraphPropertyTable
GraphPropertyTable implements the Graph interface (readonly) over a
PropertyTable. This is subclass from GraphBase and implements find().
The find() method needs to choose the access route based on the find
arguments. This will offer the PropertyTable interface, or an
appropriate subset or variation, so that such a graph can be treated
in a more table-like fashion.
GraphCSV is a sub class of GraphPropertyTable for aiming at CSV
powered by th> For now, the API of PropertyTable is enough for performing SPARQL
querying. If the advanced features are required in future, it's
possible to add some methods to PropertyTable, Row or Column. For
example, is PropertyTable supposed to be mutable or read-only? Any
other suggestion for the API, e.g. for an SQL database?e CSV Parser.
-----

GraphPropertyTable and GraphCSV have also been implemented. Please
check the test case [2], which realizes the example from the project
proposal that performs SPARQL querying over a GraphCSV.

For now, the API of PropertyTable is enough for performing SPARQL
querying. If the advanced features are required in future, it's
possible to add some methods to PropertyTable, Row or Column. For
example, is PropertyTable supposed to be mutable or read-only? Any
other suggestion for the API, e.g. for an SQL database?

In the next steps, I'd like to refine the code and add more tests, for
more robust CSV parsing and SPARQL querying, especially for the
problems you pointed out in your previous email on 26th May.

Cheers,
Ying Jiang

[1]
https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/
[2]
https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/test/java/com/hp/hpl/jena/propertytable/impl/GraphCSVTest.java

On Wed, Jun 4, 2014 at 11:07 PM, Andy Seaborne <[email protected]> wrote:

Ying,

The next part of the project is the property tables which are compact
storage for CSV files that exploit the regular structure of the data.

These will be useful both for CSV files but potentially for other uses
storing regular data (outside this project).  These are what an SQL database
would call .... a "table" :-)

For the "Design PropertyTable" item, it would be really good to write up and
share a design on this list.

         Andy


On 26/05/14 11:06, Andy Seaborne wrote:


On 24/05/14 14:32, Ying Jiang wrote:


Dear Andy,

I see the discussion of JENA-699 about the CSV/TSV parser. It seems
that Apache Commons CSV would be a better choice for future.
Therefore, I'm not strictly following the project plan in the proposal
[1], which I'm supposed to develop the CSV parser at the beginning of
the project.



It looks very good.

As I'm finding on the CSV - working group "CSV" is a somewhat broad
catch-all piece of terminology, ranging from using ";" for the separator
(common in areas of the world where the decimal number separator is ",")
to fixed width layout.  We'll stick to RFC 4180 CSV files for now. There
is going to be a revised spec at some time but not soon.

One of the advantages of Apache Commons CSV, or other parsers, is the
ability to cope with the variety out there.  The CSV parser dropped in
recently only does comma separated, properly escaped files.

(Honestly, it was quicker to write it that investiage all the existing
parser! It was needed quicker for SPARQL test cases.).

Instead, I'm working on  "2.1 RIOT Reader for CSV Files". Things are
going well until now. I just and the new "LangCSV" and its unit test.
Please check the code commited just now. Any comments are welcome!



Slight problem:

--------------
col1, col2
abc,"23""4"
--------------

"23""4" is a CSV field using quotes and "" is an internal escaped double
quote charcater - the base CSV parser deals with the quotes.

So it the token is the string 23"4

You call LangCSV.parse which in turn invokes the tokenizer for Turtle
which then complains as 23"4 is a mess in Turtle.

There's no need to parse - either it's a string or a double (for now).
It's not an a RDF term with language, datatype etc (the SPARQL results
in TSV does do that)

Fix added - I also abused the parsers use of row/col for CSV errors.


In the next week, I'd like to complete 2.1, which means Jena can read
".csv" file into Model.



As you do this, more tests to push all the cases are going to be needed,
both for more strange cases like the above, and other situations
including what happens when a column name has a space in it?  Or other
non-URI fragment character in it (answer - %-encode it).

For testing the outcome of parsing, you can determine if two models are
"the same" by using model.isIsomorphicWith(otherModel)

It returns true/false depending on whether there is a consistent
renaming of bNodes from one model to the other (that's the isomorphism).

So testing can have the right answer as a Turtle model, and compare ti
to the parsed CSV file.

      Andy


Best regards,
Ying Jiang

[1]

http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080


https://issues.apache.org/jira/browse/JENA-625

(I thought all accepted melange proposals became public automatically
when accepted and the programme started.  Maybe it'll happen soon.)

Re: GSoC routine

Reply via email to