Re: InsideOut10 contributed engines

Rupert Westenthaler Wed, 12 Dec 2012 03:46:46 -0800

Hi all,

Just a short update on this.

## Freeling Integration

In the meantime I was writing 30+ mails with David to get freeling [3]
installed on my Mac. After fixing several issues with library
versions, build process and finally runtime problems I do now have a
working version on my Mac. I am currently making myself common to the
freeling API while checking how to best align the freeling framework
to the Stanbol NLP processing module.

* Freeling Language Identification: This would only add detection for
for "ca", "gl" to what is currently supported with the langdetect
engine. In addition the example for "sr" (Serbian) is classified as
"mk"(Macedonian).

* Freeling PoS Tagging: While the code performs all analysis available
currently only Nouns are extracted and used. To have a full
integration to the Stanbol NLP processing this needs to undergo
considerable refactoring and extensions.

Based on my current understanding I would:

* create a {trunk}/enhancement-engines/freeling folder with the
following modules
* freeling-definitions: Module that defines all the constants and
mapping. e.g. the mappings for the POS Tags used by Freeling to the
Olia Ontology used by the Stanbol NLP processing. The reason for
having this in an own module is that Freeling is GPL and this module
while representing an major part of the work will not need to depend
on the Freeling API. Therefore this will allow to release those things
under an Apache License.
* freeling-service: Initialises Freeling framework and registers the
Freeling Analysis services with the OSGI ServiceAdmin. This module
will allow for configuring Freeling (such as providing the path to the
Freeling configuration and the native libraries) and also allow other
modules to lookup Freeling functionality by using @Reference
annotations and/or OSGI ServiceTracker.
* freeling-langid: Same as the contributed engine
* freeling-analyse: Does all the analysis steps (token, sentence, pos,
chunk, ner, lemma ...) and stores the results to the AnalyzedText.
Splitting up the different analysis steps is not possible as the JNI
wrapper does not allow to manually construct elements (e.g. a WordList
needed as input for the sentence detector). However the
LanguageConfiguration utility part of stanbol.enhancer.nlp module will
be used to allow activating, deactivating analysis by default and/or
for specific languages (e.g. *;ner=false, en;ner=true) and the
AnalyzedText does already support merging of results provided by
different engines (e.g. if you add an Chunk that already exists the
existing Chunk will be returned and annotations of different Engines
will be merged.

### License considerations:

Freeling is licensed under GPLv3 a license that is NOT compatible with
Apache. This is also true for the java API (an JNI wrapper over the C
stuff) and all modules that depend on this API - basically the
freeling-** modules.

For Freeling itself this is not a big issue as Freeling needs to be
anyway downloaded, compiled and installed separately on the machine
running the Stanbol freeling engines. The main hurdle is the Java API
we need to link against. Because of that at least the
freeling-service, freeling-langid and freeling-analyse will not be
release-able by Apache Stanbol.

As I do not have any experience on how to deal with situations like
comments would be very welcome.

## TextAnnotations New Model

We need to define an Issue for switching to this model. My suggestion
is that we make the change immediately after the next 0.10.0 release.

### Some explanation about the new fise:TextAnnotation model

See also the documentation at the end of the fise:TextAnnotation section at [2]

This will change fise:TextAnnotations to adopt

* fise:selection-prefix: some words/characters before the selected section.
* fise:selection-suffix: some words/characters after the selected section.

In addition it will introduce

* fise:selection-head: the first few word/characters of a the selected
section within the text.
* fise:selection-tail: the last few words/characters of a selected
section. To be used together with fise:selection-head.

Those two properties are alternatives to fise:selected-text and
intended to be used in cases where EnhancementEngines what to select
whole sentences, paragraphs or even sections of the text. The main
intension is to avoid to repeat long parts of the context as a literal
in the RDF graph. Word and Phrase level annotations will not be
affected by this.

The fise:selection-context will still be supported but its semantic
will be changed to describing those part of the content that was used
as a context for the annotation. Its use for identifying the correct
location of the annotation within the text will be discouraged after
this change.

### Contributed Engine

I strongly suggest to accept this engine as it provides a good
solution in case users want to use Engines that still use the current
model with client side code that is written for the new Model.

## Freebase Entity Recognition

I had not yet time to look into this in more detail. One thing I would
like to check if it is feasible to implement the EntitySearcher [5]
used by the new EntityLinkingEngine because if this would be the case
the integration should be really strait forward. In addition future
enhancements to the EntityLinking process would automatically also
apply to this engine.

David could you have a look at the EntitySearcher [5] interface. You
can also use the EntityhubLinkingEngine (docs: [6], source [7]) as an
example on how to use the generic EntityLinkingEngine with a specific
EntitySearcher implementation.

## Schema.org Refactorer

Had no time to look at this

# Next Steps

* I plan to create JIRA issues for the tasks as described above. I
will make them as replacing STANBOL-807 [4] Sorry I need to create new
issues as STANBOL-807 is to broad in scope.
* AFAIK we do need code contributions uploaded as archives to JIRA. So
if nobody replays to this claiming otherwise I will ask David to
formally contribute modules to those Issues.
* Resolve license issues with the GPL licensed Freeling

I plan to work on the Freeling stuff first. Mainly because I expect
those work as a very welcome opportunity to validate the Stanbol NLP
processing API.

Thanks for contributing this to the Stanbol Community

best
Rupert

On Thu, Nov 15, 2012 at 8:20 AM, David Riccitelli <[email protected]> wrote:

> [2]
> http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html#fisetextannotation
> [3] http://nlp.lsi.upc.edu/freeling/
> [4] https://issues.apache.org/jira/browse/STANBOL-807

[5]
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#entitysearcher
[6]
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entityhublinking
[7]
http://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/entityhublinking/src/main/java/org/apache/stanbol/enhancer/engines/entityhublinking/EntityhubLinkingEngine.java
--
| Rupert Westenthaler [email protected]
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: InsideOut10 contributed engines

Reply via email to