Hi,

Rupert Westenthaler wrote:
Hi Walter, all

I had already a look at the htmlextractor and I think it is a nice
addition to Stanbol!

I would be interested in an Engine that does not only extract embedded
knowledge, but also keeps the link to the actual position within the
parsed Content. In more detail I would like to link the extracted
knowledge with an fise:Enhancement (e.g. a fise:TextAnnotation) that
selects the annotated part of the content.

This would not only allow to have the extracted knowledge in the
metadata of the ContentItem, but also allow EnhancementEngines to
process those information in the same way as if they would be
extracted by an other engine (e.g. linking an RDFa annotation about an
Person, Place in the same way as an Person, Place detected by an NER
engine).

I think that could be done.


Jukka Zitting  presentation "Content extraction with Apache Tika" [1]
at the ApacheCon included a nice example on how to extract the text of
an Link. I think this is a nice starting point for such an feature.

Generally I think it would be better to add RDFa, Micro Data support
to directly to Tika instead of implementing custom solutions within
Stanbol. WDYT?

Tika currently is not suitable for RDFa extraction etc. because its HTML parser (TagSoup) throws away all namespace declarations needed for the RDF.

Best regards,

Walter

--
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: [email protected]
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------

Reply via email to