customvocabulary.mdtext

rwesten Mon, 02 Jun 2014 01:04:27 -0700

Author: rwesten
Date: Mon Jun  2 08:02:59 2014
New Revision: 1599111

URL: http://svn.apache.org/r1599111
Log:
chainged remaining keyword linking mentions to entity linking


Modified:
    stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext

Modified: stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext?rev=1599111&r1=1599110&r2=1599111&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext Mon Jun  2 
08:02:59 2014
@@ -67,7 +67,7 @@ Users of the Entityhub Indexing Tool wil
 
 The indexing tool provides a default configuration for creating an [Apache 
Solr](http://lucene.apache.org/solr/) index of RDF files (e.g. a SKOS export of 
a thesaurus or a set of foaf files).
 
-To build the indexing tool from source - recommended - you will need to 
checkout Apache Stanbol form SVN (or [download](../../downloads) a 
source-release). Instructions for this can be found [here](tutorial.html). 
However if you want to skip this you can also obtain a [binary 
version](http://dev.iks-project.eu/downloads/stanbol-launchers/) from the IKS 
development server (search the sub-folders of the different versions for a file 
named like 
"<code>org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar</code>").
+To build the indexing tool from source - recommended - you will need to 
checkout Apache Stanbol form SVN (or [download](../../downloads) a 
source-release). Instructions for this can be found [here](tutorial.html). 
However if you want to skip this you can also obtain a [binary 
version](http://dev.iks-project.eu/downloads/stanbol-launchers/) from the IKS 
development server (search the sub-folders of the different versions for a file 
named like 
"`org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar`").
 
 In case you downloaded or "svn co" the source to {stanbol-source} and 
successfully build the source as described in the [Tutorial](tutorial.html) you 
still need to assembly the indexing tool by
  
@@ -94,19 +94,19 @@ Initialize the tool with
 
 This will create/initialize the default configuration for the Indexing Tool 
including (relative to {indexing-working-dir}):
 
-*  <code>/indexing/config</code>: Folder containing the default configuration 
including the "indexing.properties" and "mappings.txt" file.
-*  <code>/indexing/resources</code>: Folder with the source files used for 
indexing including the "rdfdata" folder where you will need to copy the RDF 
files to be indexed
-*  <code>/indexing/destination</code>: Folder used to write the data during 
the indexing process.
-*  <code>/indexing/dist</code>: Folder where you will find the 
<code>{name}.solrindex.zip</code> and 
<code>org.apache.stanbol.data.site.{name}-{version}.jar</code> files needed to 
install your index to the Apache Stanbol Entityhub.
-
-After the initialization you will need to provide the following configurations 
in files located in the configuration folder 
(<code>{indexing-working-dir}/indexing/config</code>)
-
-* Within the <code>indexing.properties</code> file you need to set the {name} 
of your index by changing the value of the "name" property. In addition you 
should also provide a "description". At the end of the indexing.properties file 
you can also specify the license and attribution for the data you index. The 
Apache Entityhub will ensure that those information will be included with any 
entity data returned for requests.
-* Optionally, if your data do use namespaces that are not present in 
[prefix.cc](http://prefix.cc) (or the server used for indexing does not have 
internet connectivity) you can manually define required prefixes by 
creating/using the a <code>indexing/config/namespaceprefix.mappings</code> 
file. The syntax is '<code>'{prefix}\t{namespace}\n</code>' where 
'<code>{prefix} ... [0..9A..Za..z-_]</code>' and '<code>{namespace} ... must 
end with '#' or '/' for URLs and ':' for URNs</code>'.
-* Optionally, if the data you index do use some none common namespaces you 
will need to add those to the <code>mapping.txt</code> file (here is an 
[example](examples/anl-mappings.txt)  including default and specific mappings 
for one dataset)
-* Optionally, if you want to use a custom SolrCore configuration the core 
configuration needs to be copied to the 
<code>indexing/config/{core-name}</code>. Default configuration - to start from 
- can be downloaded from the [Stanbol 
SVN](https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/)
 and extracted to the <code>indexing/config/</code> folder. If the {core-name} 
is different from the 'name' configured in the <code>indexing.properties</code> 
than the '<code>solrConf</code>' parameter of the 
'<code>indexingDestination</code>' MUST be set to 
'<code>solrConf:{core-name}</code>'. After those configurations users can make 
custom adaptations to the SolrCore configuration used for indexing. 
+*  `/indexing/config`: Folder containing the default configuration including 
the "indexing.properties" and "mappings.txt" file.
+*  `/indexing/resources`: Folder with the source files used for indexing 
including the "rdfdata" folder where you will need to copy the RDF files to be 
indexed
+*  `/indexing/destination`: Folder used to write the data during the indexing 
process.
+*  `/indexing/dist`: Folder where you will find the `{name}.solrindex.zip` and 
`org.apache.stanbol.data.site.{name}-{version}.jar` files needed to install 
your index to the Apache Stanbol Entityhub.
+
+After the initialization you will need to provide the following configurations 
in files located in the configuration folder 
(`{indexing-working-dir}/indexing/config`)
+
+* Within the `indexing.properties` file you need to set the {name} of your 
index by changing the value of the "name" property. In addition you should also 
provide a "description". At the end of the indexing.properties file you can 
also specify the license and attribution for the data you index. The Apache 
Entityhub will ensure that those information will be included with any entity 
data returned for requests.
+* Optionally, if your data do use namespaces that are not present in 
[prefix.cc](http://prefix.cc) (or the server used for indexing does not have 
internet connectivity) you can manually define required prefixes by 
creating/using the a `indexing/config/namespaceprefix.mappings` file. The 
syntax is '`'{prefix}\t{namespace}\n`' where '`{prefix} ... [0..9A..Za..z-_]`' 
and '`{namespace} ... must end with '#' or '/' for URLs and ':' for URNs`'.
+* Optionally, if the data you index do use some none common namespaces you 
will need to add those to the `mapping.txt` file (here is an 
[example](examples/anl-mappings.txt)  including default and specific mappings 
for one dataset)
+* Optionally, if you want to use a custom SolrCore configuration the core 
configuration needs to be copied to the `indexing/config/{core-name}`. Default 
configuration - to start from - can be downloaded from the [Stanbol 
SVN](https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/)
 and extracted to the `indexing/config/` folder. If the {core-name} is 
different from the 'name' configured in the `indexing.properties` than the 
'`solrConf`' parameter of the '`indexingDestination`' MUST be set to 
'`solrConf:{core-name}`'. After those configurations users can make custom 
adaptations to the SolrCore configuration used for indexing. 
 
-Finally you will also need to copy your source files into the source directory 
<code>{indexing-working-dir}/indexing/resources/rdfdata</code>. All files 
within this directory will be indexed. THe indexing tool support most common 
RDF serialization. You can also directly index compressed RDF files.
+Finally you will also need to copy your source files into the source directory 
`{indexing-working-dir}/indexing/resources/rdfdata`. All files within this 
directory will be indexed. THe indexing tool support most common RDF 
serialization. You can also directly index compressed RDF files.
 
 For more details about possible configurations, please consult the 
[README](https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/genericrdf/README.md).
 
@@ -116,26 +116,26 @@ Once all source files are in place, you 
     $ cd {indexing-working-dir}
     $ java -Xmx1024m -jar 
org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar 
index
 
-Depending on your hardware and on complexity and size of your sources, it may 
take several hours to built the index. As a result, you will get an archive of 
an [Apache Solr](http://lucene.apache.org/solr/) index together with an OSGI 
bundle to work with the index in Stanbol. Both files will be located within the 
<code>indexing/dist</code> folder.
+Depending on your hardware and on complexity and size of your sources, it may 
take several hours to built the index. As a result, you will get an archive of 
an [Apache Solr](http://lucene.apache.org/solr/) index together with an OSGI 
bundle to work with the index in Stanbol. Both files will be located within the 
`indexing/dist` folder.
 
 _IMPORTANT NOTES:_ 
 
 * The import of the RDF files to the Jena TDB triple store - used as source 
for the indexing - takes a lot of time. Because of that imported data are 
reused for multiple runs of the indexing tool. This has two important effects 
users need to be aware of:
 
-    1. Already imported RDF files should be removed from the 
<code>{indexing-working-dir}/indexing/resources/rdfdata</code> to avoid to 
re-import them on every run of the tool. NOTE: newer versions of the Entityhub 
indexing tool might automatically move successfully imported RDF files to a 
different folder.
-    2. If the RDF data change you will need to delete the Jena TDB store so 
that those changes are reflected in the created index. To do this delete the 
<code>{indexing-working-dir}/indexing/resources/tdb</code> folder
+    1. Already imported RDF files should be removed from the 
`{indexing-working-dir}/indexing/resources/rdfdata` to avoid to re-import them 
on every run of the tool. NOTE: newer versions of the Entityhub indexing tool 
might automatically move successfully imported RDF files to a different folder.
+    2. If the RDF data change you will need to delete the Jena TDB store so 
that those changes are reflected in the created index. To do this delete the 
`{indexing-working-dir}/indexing/resources/tdb` folder
 
-* Also the destination folder 
<code>{indexing-working-dir}/indexing/destination</code> is NOT deleted between 
multiple calls to index. This has the effect that Entities indexed by previous 
indexing calls are not deleted. While this allows to index a dataset in 
multiple steps - or even to combine data of multiple datasets in a single index 
- this also means that you will need to delete the destination folder if the 
RDF data you index have changed - especially if some Entities where deleted. 
+* Also the destination folder `{indexing-working-dir}/indexing/destination` is 
NOT deleted between multiple calls to index. This has the effect that Entities 
indexed by previous indexing calls are not deleted. While this allows to index 
a dataset in multiple steps - or even to combine data of multiple datasets in a 
single index - this also means that you will need to delete the destination 
folder if the RDF data you index have changed - especially if some Entities 
where deleted. 
 
 
 ### Step 3 : Initialize the index within Apache Stanbol
 
 We assume that you already have a running Apache Stanbol instance at 
http://{stanbol-host} and that {stanbol-working-dir} is the working directory 
of that instance on the local hard disk. To install the created index you need 
to 
 
-* copy the "{name}.solrindex.zip" file to the 
<code>{stanbol-working-dir}/stanbol/datafiles</code> directory (NOTE if you run 
the 0.9.0-incubating version the path is 
<code>{stanbol-working-dir}/sling/datafiles</code>).
-* install the <code>org.apache.stanbol.data.site.{name}-{version}.jar</code> 
to the OSGI environment of your Stanbol instance e.g. by using the Bundle tab 
of the Apache Felix web console at 
</code>http://{stanbol-host}/system/console/bundles</code>
+* copy the "{name}.solrindex.zip" file to the 
`{stanbol-working-dir}/stanbol/datafiles` directory (NOTE if you run the 
0.9.0-incubating version the path is `{stanbol-working-dir}/sling/datafiles`).
+* install the `org.apache.stanbol.data.site.{name}-{version}.jar` to the OSGI 
environment of your Stanbol instance e.g. by using the Bundle tab of the Apache 
Felix web console at `http://{stanbol-host}/system/console/bundles`
 
-You find both files in the <code>{indexing-working-dir}/indexing/dist/</code> 
folder.
+You find both files in the `{indexing-working-dir}/indexing/dist/` folder.
 
 After the installation your data will be available at
 
@@ -151,13 +151,13 @@ This section covers how to configure the
 Generally there are two possible ways you can use to recognize entities of 
your vocabulary:
 
 1. __Named Entity Linking__: This first uses Named Entity Recoqunition (NER) 
for spotting "named entities" in the text and second try to link those named 
entities with entities defined in your vocabulary. This approach is limited to 
entities with the type person, organization and places. So if your vocabulary 
contains entities of other types, they will not be recognized. In addition it 
also requires the availability of NER for the language(s) of the processed 
documents.
-2. __Keyword Linking__: This uses the labels of entities in your vocabulary 
for the recognition and linking process. Natural Language Processing (NLP) 
techniques such as part-of-speach (POS) detection can be used to improve 
performance and results but this works also without NLP support. As extraction 
and linking is based on labels mentioned in the analyzed content this method 
has no restrictions regarding the types of your entities.
+2. __Entity Linking__: This uses the labels of entities in your vocabulary for 
the recognition and linking process. Natural Language Processing (NLP) 
techniques such as part-of-speach (POS) detection can be used to improve 
performance and results but this works also without NLP support. As extraction 
and linking is based on labels mentioned in the analyzed content this method 
has no restrictions regarding the types of your entities.
 
 For more information about this you might also have a look at the introduction 
of the [multi lingual](multilingual) usage scenario.
 
 _TIP_: If you are unsure about what to use you can also start with configuring 
both options to give it a try. 
 
-Depending on if you want to use named entity linking or keyword linking the 
configuration of the [enhancement chain](components/enhancer/chains) and the 
[enhancement engine](components/enhancer/engines) making use of your vocabulary 
will be different.
+Depending on if you want to use _named entity linking_ or _entity linking_ the 
configuration of the [enhancement chain](components/enhancer/chains) and the 
[enhancement engine](components/enhancer/engines) making use of your vocabulary 
will be different. The following two sub-sections provide more information on 
that.
 
 ### Configuring Named Entity Linking
 
@@ -166,7 +166,7 @@ For the configuration of this engine you
 
 1. The "name" of the enhancement engine. It is recommended to use 
"{name}Linking" - where {name} is the name of the Entityhub Site (ReferenceSite 
or ManagedSite).
 2. The name of the referenced site holding your vocabulary. Here you have to 
configure the {name}.
-3. Enable/disable persons, organizations and places and if enabled configure 
the <code>rdf:type</code> used by your vocabulary for those type. If you do not 
want to restrict the type, you can also leave the type field empty.
+3. Enable/disable persons, organizations and places and if enabled configure 
the `rdf:type` used by your vocabulary for those type. If you do not want to 
restrict the type, you can also leave the type field empty.
 4. Define the property used to match against the named entities detected by 
the used NER engine(s).
 
 For more detailed information please see the documentation of the [Named 
Entity Tagging 
Engine](components/enhancer/engines/namedentitytaggingengine.html).
@@ -198,7 +198,7 @@ To use _Entity Linking_ with a custom Vo
     * in case of the Entityhub Linking Engine the "Label Field" needs to be 
set to the URI of the property holding the labels. You can only use a single 
field. If you want to use values of several fields you need to adapt your 
indexing configuration to copy the values of those fields to a single one (e.g. 
by adding `skos:prefLabel > rdfs:label` and `skos:altLabel > rdfs:label` to the 
`{indexing-working-dir}/indexing/config/mappings.txt` config.
     * in case of the FST Linking engine you need to provide the [FST Tagging 
Configuration](components/enhancer/engines/lucenefstlinking#fst-tagging-configuration).
 If you store your labels in the `rdfs:label` field and you want to support all 
languages present in your vocabulary use `*;field=rdfs:label;generate=true`. 
_NOTE_ that `generate=true` is required to allow the engine to (re)create FST 
models at runtime.
 4. The "Link ProperNouns only": If the custom Vocabulary contains Proper Nouns 
(Named Entities) than this parameter should be activated. This options causes 
the Entity Linking process to not making queries for commons nouns and by that 
receding the number of queries agains the controlled vocabulary by ~70%. 
However this is not feasible if the vocabulary does contain Entities that are 
common nouns in the language. 
-5. The "Type Mappings" might be interesting for you if your vocabulary 
contains custom types as those mappings can be used to map 'rdf:type's of 
entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - 
created by the Apache Stanbol Enhancer to annotate occurrences of extracted 
entities in the parsed text. See the [type mapping 
syntax](components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax)
 and the [usage scenario for the Apache Stanbol Enhancement 
Structure](enhancementusage.html#entity-tagging-with-disambiguation-support) 
for details.
+5. The "Type Mappings" might be interesting for you if your vocabulary 
contains custom types as those mappings can be used to map 'rdf:type's of 
entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - 
created by the Apache Stanbol Enhancer to annotate occurrences of extracted 
entities in the parsed text. See the [type mapping 
syntax](components/enhancer/engines/entitylinking.html#type-mappings-syntax) 
and the [usage scenario for the Apache Stanbol Enhancement 
Structure](enhancementusage.html#entity-tagging-with-disambiguation-support) 
for details.
 
 The following Example shows an Example of an [enhancement 
chain](components/enhancer/chains) using OpenNLP for NLP

svn commit: r1599111 - /stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext

Reply via email to