[
https://issues.apache.org/jira/browse/STANBOL-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474046#comment-13474046
]
Rupert Westenthaler commented on STANBOL-765:
---------------------------------------------
Entityhub Indexing Tool Documentation regarding the Blank Node Support of the
Jena TDB Indexing Source.
## Jena TDB Indexing Source
### Blank Node Support
With [STANBOL-765](https://issues.apache.org/jira/browse/STANBOL-765) support
for indexing RDF Blank Nodes (aka. Anonymous Nodes aka. Bnodes) where added.
#### IMPORTANT
Multiple imports of Datasets with Bnodes will duplicate all RDF triples
containing Bnodes! This is important to understand as the Entityhub Indexing
Tool
1. will import RDF files in the import direcotory (default
"indexing/resources/rdfdata/") on every call
2. the Jena TDB dataset is NOT deleted on every call
Because of this it is important that Users either
* remove already imported RDF files form the import directory OR
* delete the Jena TDB directory ( "indexing/resources/tdb") for every call
otherwise triples containing Bnodes will be n-times in the Jean TDB store and
also appear n-times in the indexed Dataset (where n refers to the number of
times the Entityhub Indexing Tool was called with the "index" parameter.
#### Configuration
Indexing of Bnodes is deactivated by default! Two parameters are used by the
RdfIndexingSource to control indexing of Bnodes:
1. *bnode* [true|false]: enables/disable the indexing of bnodes. Default is
false however if a 'bnode-prefix' is configured AND bnode is not defined the
default changes to true. If bnode=true, than the default bnode-prefix is used.
2. *bnode-prefix*: Allows to set the URI prefix used for indexing Bnodes. The
default is 'urn:bnode:{site-id}:'. If a bnode-prefix is configured Bnodes are
indexed as long as the bnode parameter is not explicitly set to false.
The configuration of the RdfIndexingSource is part of the "indexing.properties"
file. TheJena TDB indexing source supports both "entityDataIterable" and
"entityDataProvider" and can therefore configured as value for both keys
(depending on the current usage scenario). While the following examples will
show configurations for "entityDataIterable" the same values can also be used
for the configuration of a "entityDataProvider".
A typical configuration that uses the default Bnode prefix looks like
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata,bnode:true
This will index all RDF files in the "indexing/resources/rdfdata" directory
including Bnodes. Bnodes will use the default URI prefix "urn:bnode:{site-id}:"
The following configuration configures a custom Bnode URI prefix
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,bnode-prefix:urn:custom.bnode.prefix:
#### Bnode URI prefix rules
For setting the 'bnode-prefix' users should follow the following rules:
* The used prefix MUST NOT be present in URIs of the RDF data. This is because
the RdfIndexngSource need to convert URIs back to Bnodes and the 'bnode-prefix'
is used to determine if an URI actually represents a Bnode in the Jena TDB
dataset. If URI resources of the RDF data do use the 'bnode-prefix' the LDpath
implementation of the RdfIndexingSource will not work correctly.
* The used prefix SHOULD be unique for the indexed RDF data. While Bnodes do
use a random IDs that will not likely clash users need to consider that if they
use the same 'bnode-prefix' for multiple ReferencedSites imported to a Stanbol
Entityhub might clash. In such cases Entity lookup considering multiple
ReferencedSites might return unexpected information. Because of this it is
recommended to include the ID of the site within the 'bnode-prefix' (e.g.
urn:bnode:{siteId}:)
> Add support for importing Bnodes to the Jena TDB indexing source
> ----------------------------------------------------------------
>
> Key: STANBOL-765
> URL: https://issues.apache.org/jira/browse/STANBOL-765
> Project: Stanbol
> Issue Type: New Feature
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> The Stanbol Entityhub does (intentional) not support Bnodes (RDF blank nodes)
> as those are not rereferable. Because of that the Jena TDB indexing source is
> up to now ignoring Bnodes both for subjects and Objects - basically if a
> Triple contains any Bnode it is skipped.
> While adding support for Bnodes to the Entityhub is not possible it is
> feasible to allow users to convert Bnode ids to valid URIs by providing a
> prefix (or base URI) in the configuration of the Jena TDB indexing source.
> If this configuration is present Bnodes of the indexed RDF graph will be
> converted to URIs by using "{bnode-prefix}{bnodId}". Users that do use this
> feature need to be aware that they do change the RDF graph (and do make
> preciously local resources globally dereferable via the Entityhub RESTful
> API).
> There will be no default value for the bnode prefix. Users will need to
> explicitly define it (e.g. in the indexing.properties file). If no bnode
> prefix is configured Bnodes will be skipped (current behavior). This ensures
> also backward compatibility.
> NOTE: This will not fix possible memory problems when importing RDF files
> that do include BNodes into the Jena TDB indexing source. As Jena needs still
> to keep an lookup table over all BNodes referenced in the currently imported
> file.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira