On Fri, Nov 9, 2012 at 7:30 AM, Walter Kasper <[email protected]> wrote:
> Hi, > > > Rupert Westenthaler wrote: > >> Hi Walter, all >> >> I had already a look at the htmlextractor and I think it is a nice >> addition to Stanbol! >> >> I would be interested in an Engine that does not only extract embedded >> knowledge, but also keeps the link to the actual position within the >> parsed Content. In more detail I would like to link the extracted >> knowledge with an fise:Enhancement (e.g. a fise:TextAnnotation) that >> selects the annotated part of the content. >> >> This would not only allow to have the extracted knowledge in the >> metadata of the ContentItem, but also allow EnhancementEngines to >> process those information in the same way as if they would be >> extracted by an other engine (e.g. linking an RDFa annotation about an >> Person, Place in the same way as an Person, Place detected by an NER >> engine). >> > > I think that could be done. > > > >> Jukka Zitting presentation "Content extraction with Apache Tika" [1] >> at the ApacheCon included a nice example on how to extract the text of >> an Link. I think this is a nice starting point for such an feature. >> >> Generally I think it would be better to add RDFa, Micro Data support >> to directly to Tika instead of implementing custom solutions within >> Stanbol. WDYT? >> > > Tika currently is not suitable for RDFa extraction etc. because its HTML > parser (TagSoup) throws away all namespace declarations needed for the RDF. > You might want to consider any23 [1], another Apache project which can extract RDFa and other semantic markups from HTML. There are also some independent RDFa parser you can use in java such as [2]. Steph. [1] http://any23.apache.org/extractors.html [2] https://github.com/niklasl/clj-rdfa-jena > > Best regards, > > Walter > > > -- > Dr. Walter Kasper > DFKI GmbH > Stuhlsatzenhausweg 3 > D-66123 Saarbrücken > Tel.: +49-681-85775-5300 > Fax: +49-681-85775-5338 > Email: [email protected] > ------------------------------**------------------------------**- > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern > > Geschaeftsfuehrung: > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > Dr. Walter Olthoff > > Vorsitzender des Aufsichtsrats: > Prof. Dr. h.c. Hans A. Aukes > > Amtsgericht Kaiserslautern, HRB 2313 > ------------------------------**------------------------------**- > > -- Steph.
