Author: jpz6311whu
Date: Sun Aug 3 10:30:24 2014
New Revision: 1615400
URL: http://svn.apache.org/r1615400
Log:
Implementation Documentation for CSV PropertyTable [JENA 625]
Modified:
jena/site/trunk/content/documentation/csv/design.mdtext
jena/site/trunk/content/documentation/csv/implementation.mdtext
Modified: jena/site/trunk/content/documentation/csv/design.mdtext
URL:
http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/csv/design.mdtext?rev=1615400&r1=1615399&r2=1615400&view=diff
==============================================================================
--- jena/site/trunk/content/documentation/csv/design.mdtext (original)
+++ jena/site/trunk/content/documentation/csv/design.mdtext Sun Aug 3 10:30:24
2014
@@ -9,7 +9,7 @@ The architecture of CSV PropertyTable ma

-### PropertyTable
+## PropertyTable
A `PropertyTable` is collection of data that is sufficiently regular in shape
it can be treated as a table.
That means each subject has a value for each one of the set of properties.
@@ -29,34 +29,34 @@ A `PropertyTable` should be constructed
1. Create `Columns` using `PropertyTable.createColumn()` for each `Column`
of the `PropertyTable`
2. Create `Rows` using `PropertyTable.createRow()` for each `Row` of the
`PropertyTable`
-3. For each `Row' created, set a value (`Node`) at the specified `Column`,
by calling `Row.setValue()`
+3. For each `Row` created, set a value (`Node`) at the specified `Column`,
by calling `Row.setValue()`
Once a `PropertyTable` is built, tabular data within can be accessed by the
API of `PropertyTable.getMatchingRows()`, `PropertyTable.getColumnValues()`,
etc.
-### GraphPropertyTable
+## GraphPropertyTable
`GraphPropertyTable` implements the
[Graph](https://svn.apache.org/repos/asf/jena/trunk/jena-core/src/main/java/com/hp/hpl/jena/graph/Graph.java)
interface (read-only) over a `PropertyTable`.
This is subclass from
[GraphBase](https://svn.apache.org/repos/asf/jena/trunk/jena-core/src/main/java/com/hp/hpl/jena/graph/impl/GraphBase.java)
and implements `find()`.
-The `graphBaseFind()` method can choose the access route based on the find
arguments.
-It holds/wraps an reference of the `PropertyTable` instance, so that such a
graph can be treated in a more table-like fashion.
+The `graphBaseFind()`(for matching a `Triple`) and
`propertyTableBaseFind()`(for matching a whole `Row`) methods can choose the
access route based on the find arguments.
+`GraphPropertyTable` holds/wraps an reference of the `PropertyTable` instance,
so that such a `Graph` can be treated in a more table-like fashion.
**Note:** Both `PropertyTable` and `GraphPropertyTable` are *NOT* restricted
to CSV data.
They are supposed to be compatible with any table-like data sources, such as
relational databases, Microsoft Excel, etc.
-### GraphCSV
+## GraphCSV
[GraphCSV](https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/org/apache/jena/propertytable/impl/GraphCSV.java)
is a sub class of GraphPropertyTable aiming at CSV data.
Its constructor takes a CSV file path as the parameter, parse the file using a
CSV Parser, and makes a `PropertyTable` through `PropertyTableBuilder`.
For CSV to RDF mapping, we establish some basic principles:
-#### Single-Value and Regular-Shaped CSV only
+### Single-Value and Regular-Shaped CSV only
In the [CSV-WG](https://www.w3.org/2013/csvw/wiki/Main_Page), it looks like
duplicate column names are not going to be supported. Therefore, we just
consider parsing single-valued CSV tables.
There is the current editor working [draft](http://w3c.github.io/csvw/syntax/)
from the CSV on the Web Working Group, which is defining a more regular data
out of CSV.
This is the target for the CSV work of GraphCSV: tabular regular-shaped CSV;
not arbitrary, irregularly shaped CSV.
-#### No Additional CSV Metadata
+### No Additional CSV Metadata
A CSV file with no additional metadata is directly mapped to RDF, which makes
a simpler case compared to SQL-to-RDF work.
It's not necessary to have a defined primary column, similar to the primary
key of database. The subject of the triple can be generated through one of:
@@ -64,11 +64,11 @@ It's not necessary to have a defined pri
1. The triples for each row have a blank node for the subject, e.g.
something like the illustration
2. The triples for row N have a subject URI which is `<FILE#_N>`.
-#### Data Type for Typed Literal
+### Data Type for Typed Literal
All the values in CSV are parsed as strings line by line. As a better option
for the user to turn on, a dynamic choice which is a posh way of saying attempt
to parse it as an integer (or decimal, double, date) and if it passes, it's an
integer (or decimal, double, date).
-#### File Path as Namespace
+### File Path as Namespace
RDF requires that the subjects and the predicates are URIs. We need to pass in
the namespaces (or just the default namespaces) to make URIs by combining the
namespaces with the values in CSV.
We donât have metadata of the namespaces for the columns, But subjects can
be blank nodes which is useful because each row is then a new blank node. For
predicates, suppose the URL of the CSV file is `file:///c:/town.csv`, then the
columns can be `<file:///c:/town.csv#Town>` and
`<file:///c:/town.csv#Population>`, as is showed in the illustration.
Modified: jena/site/trunk/content/documentation/csv/implementation.mdtext
URL:
http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/csv/implementation.mdtext?rev=1615400&r1=1615399&r2=1615400&view=diff
==============================================================================
--- jena/site/trunk/content/documentation/csv/implementation.mdtext (original)
+++ jena/site/trunk/content/documentation/csv/implementation.mdtext Sun Aug 3
10:30:24 2014
@@ -1 +1,41 @@
-Title: CSV PropertyTable - Implementation
\ No newline at end of file
+Title: CSV PropertyTable - Implementation
+
+## PropertyTable Implementations
+
+There're 2 implementations for `PropertyTable`. The pros and cons are
summarised in the following table:
+
+PropertyTable Implementation | Description | Supported Indexes | Advantages |
Disadvantages
+---------------------------- | ----------- | ----------------- | ---------- |
-------------
+`PropertyTableArrayImpl` | implemented by a two-dimensioned Java array of
`Nodes`| SPO, PSO | compact memory usage, fast for querying with S and P, fast
for query a whole `Row` | slow for query with O, table Row/Column size provided
|
+`PropertyTableHashMapImpl` | implemented by several Java `HashMaps` | PSO, POS
| fast for querying with O, table Row/Column size not required | more memory
usage for HashMaps |
+
+By default,
[PropertyTableArrayImpl](https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/org/apache/jena/propertytable/impl/PropertyTableArrayImpl.java)
is used as the `PropertyTable` implementation held by `GraphCSV`.
+If you want to switch to
[PropertyTableHashMapImpl](https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/org/apache/jena/propertytable/impl/PropertyTableHashMapImpl.java),
just use the static method of `GraphCSV.createHashMapImpl()` to replace the
default `new GraphCSV()` way.
+Here is an example:
+
+ Model model_csv_array_impl = ModelFactory.createModelForGraph(new
GraphCSV(file)); // PropertyTableArrayImpl
+ Model model_csv_hashmap_impl =
ModelFactory.createModelForGraph(GraphCSV.createHashMapImpl(file)); //
PropertyTableHashMapImpl
+
+## StageGenerator Optimization for GraphPropertyTable
+
+Accessing from SPARQL via `Graph.find()` will work, but it's not ideal. Some
optimizations can be done for processing a SPARQL basic graph pattern. More
explicitly, in the method of `OpExecutor.execute(OpBGP, ...)`, when the target
for the query is a `GraphPropertyTable`, it can get a whole `Row`, or `Rows`,
of the table data and match the pattern with the bindings.
+
+The optimization of querying a whole `Row` in the PropertyTable are supported
now.
+The following query pattern can be transformed into a `Row` querying, without
generating triples:
+
+ ?x :prop1 ?v .
+ ?x :prop2 ?w .
+ ...
+
+It's made by using the extension point of `StageGenerator`, because it's now
just concerned with `BasicPattern`.
+The detailed workflow goes in this way:
+
+1. Split the incoming `BasicPattern` by subjects, (i.e. it becomes multiple
sub BasicPatterns grouped by the same subjects. (see
[QueryIterPropertyTable](https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/org/apache/jena/propertytable/impl/QueryIterPropertyTable.java)
)
+2. For each sub `BasicPattern`, if the `Triple` size within is greater than
1 (i.e. at least 2 `Triples`), it's turned into a `Row` querying, and processed
by
[QueryIterPropertyTableRow](https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/org/apache/jena/propertytable/impl/QueryIterPropertyTableRow.java),
else if it contains only 1 `Triple`, it goes for the traditional `Triple`
querying by `graph.graphBaseFind()`
+
+In order to turn on this optimization, we need to register the
[StageGeneratorPropertyTable](https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/org/apache/jena/propertytable/impl/StageGeneratorPropertyTable.java)
into ARQ context, before performing SPARQL querying:
+
+ StageGenerator orig =
(StageGenerator)ARQ.getContext().get(ARQ.stageGenerator) ;
+ StageGenerator stageGenerator = new StageGeneratorPropertyTable(orig) ;
+ StageBuilder.setGenerator(ARQ.getContext(), stageGenerator) ;
+