[jira] [Commented] (STANBOL-765) Add support for importing Bnodes to the Jena TDB indexing source

Rupert Westenthaler (JIRA) Thu, 11 Oct 2012 04:35:11 -0700

    [ 
https://issues.apache.org/jira/browse/STANBOL-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474046#comment-13474046
 ]


Rupert Westenthaler commented on STANBOL-765:
---------------------------------------------

Entityhub Indexing Tool Documentation regarding the Blank Node Support of the 
Jena TDB Indexing Source.

## Jena TDB Indexing Source

### Blank Node Support

With [STANBOL-765](https://issues.apache.org/jira/browse/STANBOL-765) support 
for indexing RDF Blank Nodes (aka. Anonymous Nodes aka. Bnodes) where added. 

#### IMPORTANT

Multiple imports of Datasets with Bnodes will duplicate all RDF triples 
containing Bnodes! This is important to understand as the Entityhub Indexing 
Tool 

1. will import RDF files in the import direcotory (default 
"indexing/resources/rdfdata/") on every call
2. the Jena TDB dataset is NOT deleted on every call

Because of this it is important that Users either

* remove already imported RDF files form the import directory OR
* delete the Jena TDB directory ( "indexing/resources/tdb") for every call

otherwise triples containing Bnodes will be n-times in the Jean TDB store and 
also appear n-times in the indexed Dataset (where n refers to the number of 
times the Entityhub Indexing Tool was called with the "index" parameter.

#### Configuration

Indexing of Bnodes is deactivated by default! Two parameters are used by the 
RdfIndexingSource to control indexing of Bnodes:

1. *bnode* [true|false]: enables/disable the indexing of bnodes. Default is 
false however if a 'bnode-prefix' is configured AND bnode is not defined the 
default changes to true. If bnode=true, than the default bnode-prefix is used. 
2. *bnode-prefix*: Allows to set the URI prefix used for indexing Bnodes. The 
default is 'urn:bnode:{site-id}:'. If a bnode-prefix is configured Bnodes are 
indexed as long as the bnode parameter is not explicitly set to false.


The configuration of the RdfIndexingSource is part of the "indexing.properties" 
file. TheJena TDB indexing source supports both "entityDataIterable" and 
"entityDataProvider" and can therefore configured as value for both keys 
(depending on the current usage scenario). While the following examples will 
show configurations for "entityDataIterable" the same values can also be used 
for the configuration of a "entityDataProvider".

A typical configuration that uses the default Bnode prefix looks like

    
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata,bnode:true

This will index all RDF files in the "indexing/resources/rdfdata" directory 
including Bnodes. Bnodes will use the default URI prefix "urn:bnode:{site-id}:"

The following configuration configures a custom Bnode URI prefix 

    
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,bnode-prefix:urn:custom.bnode.prefix:


#### Bnode URI prefix rules

For setting the  'bnode-prefix' users should follow the following rules:

* The used prefix MUST NOT be present in URIs of the RDF data. This is because 
the RdfIndexngSource need to convert URIs back to Bnodes and the 'bnode-prefix' 
is used to determine if an URI actually represents a Bnode in the Jena TDB 
dataset. If URI resources of the RDF data do use the 'bnode-prefix' the LDpath 
implementation of the RdfIndexingSource will not work correctly.
* The used prefix SHOULD be unique for the indexed RDF data. While Bnodes do 
use a random IDs that will not likely clash users need to consider that if they 
use the same 'bnode-prefix' for multiple ReferencedSites imported to a Stanbol 
Entityhub might clash. In such cases Entity lookup considering multiple 
ReferencedSites might return unexpected information. Because of this it is 
recommended to include the ID of the site within the 'bnode-prefix' (e.g. 
urn:bnode:{siteId}:)

                
> Add support for importing Bnodes to the Jena TDB indexing source
> ----------------------------------------------------------------
>
>                 Key: STANBOL-765
>                 URL: https://issues.apache.org/jira/browse/STANBOL-765
>             Project: Stanbol
>          Issue Type: New Feature
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> The Stanbol Entityhub does (intentional) not support Bnodes (RDF blank nodes) 
> as those are not rereferable. Because of that the Jena TDB indexing source is 
> up to now ignoring Bnodes both for subjects and Objects - basically if a 
> Triple contains any Bnode it is skipped.
> While adding support for Bnodes to the Entityhub is not possible it is 
> feasible to allow users to convert Bnode ids to valid URIs by providing a 
> prefix (or base URI) in the configuration of the Jena TDB indexing source.
> If this configuration is present Bnodes of the indexed RDF graph will be 
> converted to URIs by using "{bnode-prefix}{bnodId}". Users that do use this 
> feature need to be aware that they do change the RDF graph (and do make 
> preciously local resources globally dereferable via the Entityhub RESTful 
> API).
> There will be no default value for the bnode prefix. Users will need to 
> explicitly define it (e.g. in the indexing.properties file). If no bnode 
> prefix is configured Bnodes will be skipped (current behavior). This ensures 
> also backward compatibility. 
> NOTE: This will not fix possible memory problems when importing RDF files 
> that do include BNodes into the Jena TDB indexing source. As Jena needs still 
> to keep an lookup table over all BNodes referenced in the currently imported 
> file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-765) Add support for importing Bnodes to the Jena TDB indexing source

Reply via email to