Author: rwesten
Date: Mon Jun 2 08:02:59 2014
New Revision: 1599111
URL: http://svn.apache.org/r1599111
Log:
chainged remaining keyword linking mentions to entity linking
Modified:
stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext
Modified: stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext?rev=1599111&r1=1599110&r2=1599111&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext Mon Jun 2
08:02:59 2014
@@ -67,7 +67,7 @@ Users of the Entityhub Indexing Tool wil
The indexing tool provides a default configuration for creating an [Apache
Solr](http://lucene.apache.org/solr/) index of RDF files (e.g. a SKOS export of
a thesaurus or a set of foaf files).
-To build the indexing tool from source - recommended - you will need to
checkout Apache Stanbol form SVN (or [download](../../downloads) a
source-release). Instructions for this can be found [here](tutorial.html).
However if you want to skip this you can also obtain a [binary
version](http://dev.iks-project.eu/downloads/stanbol-launchers/) from the IKS
development server (search the sub-folders of the different versions for a file
named like
"<code>org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar</code>").
+To build the indexing tool from source - recommended - you will need to
checkout Apache Stanbol form SVN (or [download](../../downloads) a
source-release). Instructions for this can be found [here](tutorial.html).
However if you want to skip this you can also obtain a [binary
version](http://dev.iks-project.eu/downloads/stanbol-launchers/) from the IKS
development server (search the sub-folders of the different versions for a file
named like
"`org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar`").
In case you downloaded or "svn co" the source to {stanbol-source} and
successfully build the source as described in the [Tutorial](tutorial.html) you
still need to assembly the indexing tool by
@@ -94,19 +94,19 @@ Initialize the tool with
This will create/initialize the default configuration for the Indexing Tool
including (relative to {indexing-working-dir}):
-* <code>/indexing/config</code>: Folder containing the default configuration
including the "indexing.properties" and "mappings.txt" file.
-* <code>/indexing/resources</code>: Folder with the source files used for
indexing including the "rdfdata" folder where you will need to copy the RDF
files to be indexed
-* <code>/indexing/destination</code>: Folder used to write the data during
the indexing process.
-* <code>/indexing/dist</code>: Folder where you will find the
<code>{name}.solrindex.zip</code> and
<code>org.apache.stanbol.data.site.{name}-{version}.jar</code> files needed to
install your index to the Apache Stanbol Entityhub.
-
-After the initialization you will need to provide the following configurations
in files located in the configuration folder
(<code>{indexing-working-dir}/indexing/config</code>)
-
-* Within the <code>indexing.properties</code> file you need to set the {name}
of your index by changing the value of the "name" property. In addition you
should also provide a "description". At the end of the indexing.properties file
you can also specify the license and attribution for the data you index. The
Apache Entityhub will ensure that those information will be included with any
entity data returned for requests.
-* Optionally, if your data do use namespaces that are not present in
[prefix.cc](http://prefix.cc) (or the server used for indexing does not have
internet connectivity) you can manually define required prefixes by
creating/using the a <code>indexing/config/namespaceprefix.mappings</code>
file. The syntax is '<code>'{prefix}\t{namespace}\n</code>' where
'<code>{prefix} ... [0..9A..Za..z-_]</code>' and '<code>{namespace} ... must
end with '#' or '/' for URLs and ':' for URNs</code>'.
-* Optionally, if the data you index do use some none common namespaces you
will need to add those to the <code>mapping.txt</code> file (here is an
[example](examples/anl-mappings.txt) including default and specific mappings
for one dataset)
-* Optionally, if you want to use a custom SolrCore configuration the core
configuration needs to be copied to the
<code>indexing/config/{core-name}</code>. Default configuration - to start from
- can be downloaded from the [Stanbol
SVN](https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/)
and extracted to the <code>indexing/config/</code> folder. If the {core-name}
is different from the 'name' configured in the <code>indexing.properties</code>
than the '<code>solrConf</code>' parameter of the
'<code>indexingDestination</code>' MUST be set to
'<code>solrConf:{core-name}</code>'. After those configurations users can make
custom adaptations to the SolrCore configuration used for indexing.
+* `/indexing/config`: Folder containing the default configuration including
the "indexing.properties" and "mappings.txt" file.
+* `/indexing/resources`: Folder with the source files used for indexing
including the "rdfdata" folder where you will need to copy the RDF files to be
indexed
+* `/indexing/destination`: Folder used to write the data during the indexing
process.
+* `/indexing/dist`: Folder where you will find the `{name}.solrindex.zip` and
`org.apache.stanbol.data.site.{name}-{version}.jar` files needed to install
your index to the Apache Stanbol Entityhub.
+
+After the initialization you will need to provide the following configurations
in files located in the configuration folder
(`{indexing-working-dir}/indexing/config`)
+
+* Within the `indexing.properties` file you need to set the {name} of your
index by changing the value of the "name" property. In addition you should also
provide a "description". At the end of the indexing.properties file you can
also specify the license and attribution for the data you index. The Apache
Entityhub will ensure that those information will be included with any entity
data returned for requests.
+* Optionally, if your data do use namespaces that are not present in
[prefix.cc](http://prefix.cc) (or the server used for indexing does not have
internet connectivity) you can manually define required prefixes by
creating/using the a `indexing/config/namespaceprefix.mappings` file. The
syntax is '`'{prefix}\t{namespace}\n`' where '`{prefix} ... [0..9A..Za..z-_]`'
and '`{namespace} ... must end with '#' or '/' for URLs and ':' for URNs`'.
+* Optionally, if the data you index do use some none common namespaces you
will need to add those to the `mapping.txt` file (here is an
[example](examples/anl-mappings.txt) including default and specific mappings
for one dataset)
+* Optionally, if you want to use a custom SolrCore configuration the core
configuration needs to be copied to the `indexing/config/{core-name}`. Default
configuration - to start from - can be downloaded from the [Stanbol
SVN](https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/)
and extracted to the `indexing/config/` folder. If the {core-name} is
different from the 'name' configured in the `indexing.properties` than the
'`solrConf`' parameter of the '`indexingDestination`' MUST be set to
'`solrConf:{core-name}`'. After those configurations users can make custom
adaptations to the SolrCore configuration used for indexing.
-Finally you will also need to copy your source files into the source directory
<code>{indexing-working-dir}/indexing/resources/rdfdata</code>. All files
within this directory will be indexed. THe indexing tool support most common
RDF serialization. You can also directly index compressed RDF files.
+Finally you will also need to copy your source files into the source directory
`{indexing-working-dir}/indexing/resources/rdfdata`. All files within this
directory will be indexed. THe indexing tool support most common RDF
serialization. You can also directly index compressed RDF files.
For more details about possible configurations, please consult the
[README](https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/genericrdf/README.md).
@@ -116,26 +116,26 @@ Once all source files are in place, you
$ cd {indexing-working-dir}
$ java -Xmx1024m -jar
org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar
index
-Depending on your hardware and on complexity and size of your sources, it may
take several hours to built the index. As a result, you will get an archive of
an [Apache Solr](http://lucene.apache.org/solr/) index together with an OSGI
bundle to work with the index in Stanbol. Both files will be located within the
<code>indexing/dist</code> folder.
+Depending on your hardware and on complexity and size of your sources, it may
take several hours to built the index. As a result, you will get an archive of
an [Apache Solr](http://lucene.apache.org/solr/) index together with an OSGI
bundle to work with the index in Stanbol. Both files will be located within the
`indexing/dist` folder.
_IMPORTANT NOTES:_
* The import of the RDF files to the Jena TDB triple store - used as source
for the indexing - takes a lot of time. Because of that imported data are
reused for multiple runs of the indexing tool. This has two important effects
users need to be aware of:
- 1. Already imported RDF files should be removed from the
<code>{indexing-working-dir}/indexing/resources/rdfdata</code> to avoid to
re-import them on every run of the tool. NOTE: newer versions of the Entityhub
indexing tool might automatically move successfully imported RDF files to a
different folder.
- 2. If the RDF data change you will need to delete the Jena TDB store so
that those changes are reflected in the created index. To do this delete the
<code>{indexing-working-dir}/indexing/resources/tdb</code> folder
+ 1. Already imported RDF files should be removed from the
`{indexing-working-dir}/indexing/resources/rdfdata` to avoid to re-import them
on every run of the tool. NOTE: newer versions of the Entityhub indexing tool
might automatically move successfully imported RDF files to a different folder.
+ 2. If the RDF data change you will need to delete the Jena TDB store so
that those changes are reflected in the created index. To do this delete the
`{indexing-working-dir}/indexing/resources/tdb` folder
-* Also the destination folder
<code>{indexing-working-dir}/indexing/destination</code> is NOT deleted between
multiple calls to index. This has the effect that Entities indexed by previous
indexing calls are not deleted. While this allows to index a dataset in
multiple steps - or even to combine data of multiple datasets in a single index
- this also means that you will need to delete the destination folder if the
RDF data you index have changed - especially if some Entities where deleted.
+* Also the destination folder `{indexing-working-dir}/indexing/destination` is
NOT deleted between multiple calls to index. This has the effect that Entities
indexed by previous indexing calls are not deleted. While this allows to index
a dataset in multiple steps - or even to combine data of multiple datasets in a
single index - this also means that you will need to delete the destination
folder if the RDF data you index have changed - especially if some Entities
where deleted.
### Step 3 : Initialize the index within Apache Stanbol
We assume that you already have a running Apache Stanbol instance at
http://{stanbol-host} and that {stanbol-working-dir} is the working directory
of that instance on the local hard disk. To install the created index you need
to
-* copy the "{name}.solrindex.zip" file to the
<code>{stanbol-working-dir}/stanbol/datafiles</code> directory (NOTE if you run
the 0.9.0-incubating version the path is
<code>{stanbol-working-dir}/sling/datafiles</code>).
-* install the <code>org.apache.stanbol.data.site.{name}-{version}.jar</code>
to the OSGI environment of your Stanbol instance e.g. by using the Bundle tab
of the Apache Felix web console at
</code>http://{stanbol-host}/system/console/bundles</code>
+* copy the "{name}.solrindex.zip" file to the
`{stanbol-working-dir}/stanbol/datafiles` directory (NOTE if you run the
0.9.0-incubating version the path is `{stanbol-working-dir}/sling/datafiles`).
+* install the `org.apache.stanbol.data.site.{name}-{version}.jar` to the OSGI
environment of your Stanbol instance e.g. by using the Bundle tab of the Apache
Felix web console at `http://{stanbol-host}/system/console/bundles`
-You find both files in the <code>{indexing-working-dir}/indexing/dist/</code>
folder.
+You find both files in the `{indexing-working-dir}/indexing/dist/` folder.
After the installation your data will be available at
@@ -151,13 +151,13 @@ This section covers how to configure the
Generally there are two possible ways you can use to recognize entities of
your vocabulary:
1. __Named Entity Linking__: This first uses Named Entity Recoqunition (NER)
for spotting "named entities" in the text and second try to link those named
entities with entities defined in your vocabulary. This approach is limited to
entities with the type person, organization and places. So if your vocabulary
contains entities of other types, they will not be recognized. In addition it
also requires the availability of NER for the language(s) of the processed
documents.
-2. __Keyword Linking__: This uses the labels of entities in your vocabulary
for the recognition and linking process. Natural Language Processing (NLP)
techniques such as part-of-speach (POS) detection can be used to improve
performance and results but this works also without NLP support. As extraction
and linking is based on labels mentioned in the analyzed content this method
has no restrictions regarding the types of your entities.
+2. __Entity Linking__: This uses the labels of entities in your vocabulary for
the recognition and linking process. Natural Language Processing (NLP)
techniques such as part-of-speach (POS) detection can be used to improve
performance and results but this works also without NLP support. As extraction
and linking is based on labels mentioned in the analyzed content this method
has no restrictions regarding the types of your entities.
For more information about this you might also have a look at the introduction
of the [multi lingual](multilingual) usage scenario.
_TIP_: If you are unsure about what to use you can also start with configuring
both options to give it a try.
-Depending on if you want to use named entity linking or keyword linking the
configuration of the [enhancement chain](components/enhancer/chains) and the
[enhancement engine](components/enhancer/engines) making use of your vocabulary
will be different.
+Depending on if you want to use _named entity linking_ or _entity linking_ the
configuration of the [enhancement chain](components/enhancer/chains) and the
[enhancement engine](components/enhancer/engines) making use of your vocabulary
will be different. The following two sub-sections provide more information on
that.
### Configuring Named Entity Linking
@@ -166,7 +166,7 @@ For the configuration of this engine you
1. The "name" of the enhancement engine. It is recommended to use
"{name}Linking" - where {name} is the name of the Entityhub Site (ReferenceSite
or ManagedSite).
2. The name of the referenced site holding your vocabulary. Here you have to
configure the {name}.
-3. Enable/disable persons, organizations and places and if enabled configure
the <code>rdf:type</code> used by your vocabulary for those type. If you do not
want to restrict the type, you can also leave the type field empty.
+3. Enable/disable persons, organizations and places and if enabled configure
the `rdf:type` used by your vocabulary for those type. If you do not want to
restrict the type, you can also leave the type field empty.
4. Define the property used to match against the named entities detected by
the used NER engine(s).
For more detailed information please see the documentation of the [Named
Entity Tagging
Engine](components/enhancer/engines/namedentitytaggingengine.html).
@@ -198,7 +198,7 @@ To use _Entity Linking_ with a custom Vo
* in case of the Entityhub Linking Engine the "Label Field" needs to be
set to the URI of the property holding the labels. You can only use a single
field. If you want to use values of several fields you need to adapt your
indexing configuration to copy the values of those fields to a single one (e.g.
by adding `skos:prefLabel > rdfs:label` and `skos:altLabel > rdfs:label` to the
`{indexing-working-dir}/indexing/config/mappings.txt` config.
* in case of the FST Linking engine you need to provide the [FST Tagging
Configuration](components/enhancer/engines/lucenefstlinking#fst-tagging-configuration).
If you store your labels in the `rdfs:label` field and you want to support all
languages present in your vocabulary use `*;field=rdfs:label;generate=true`.
_NOTE_ that `generate=true` is required to allow the engine to (re)create FST
models at runtime.
4. The "Link ProperNouns only": If the custom Vocabulary contains Proper Nouns
(Named Entities) than this parameter should be activated. This options causes
the Entity Linking process to not making queries for commons nouns and by that
receding the number of queries agains the controlled vocabulary by ~70%.
However this is not feasible if the vocabulary does contain Entities that are
common nouns in the language.
-5. The "Type Mappings" might be interesting for you if your vocabulary
contains custom types as those mappings can be used to map 'rdf:type's of
entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's -
created by the Apache Stanbol Enhancer to annotate occurrences of extracted
entities in the parsed text. See the [type mapping
syntax](components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax)
and the [usage scenario for the Apache Stanbol Enhancement
Structure](enhancementusage.html#entity-tagging-with-disambiguation-support)
for details.
+5. The "Type Mappings" might be interesting for you if your vocabulary
contains custom types as those mappings can be used to map 'rdf:type's of
entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's -
created by the Apache Stanbol Enhancer to annotate occurrences of extracted
entities in the parsed text. See the [type mapping
syntax](components/enhancer/engines/entitylinking.html#type-mappings-syntax)
and the [usage scenario for the Apache Stanbol Enhancement
Structure](enhancementusage.html#entity-tagging-with-disambiguation-support)
for details.
The following Example shows an Example of an [enhancement
chain](components/enhancer/chains) using OpenNLP for NLP