Re: How to extract entities from documents using Microdata?

Rüdiger Kurz Sun, 17 Mar 2013 05:09:12 -0700

Hi Rupert,

also many thanks to you ...

overall I think I first have to go and start experimenting with theStanbol web-interface, anyway I wrote some comments and questions ...


Am 17.03.2013 11:12, schrieb Rupert Westenthaler:

Hi Rüdiger

On Sat, Mar 16, 2013 at 4:56 PM, Rüdiger Kurz <[email protected]> wrote:

Hi Walter,

thanks for the quick reply. Are the extracted entities from the
htmlextractor enhancement engine automatically stored into the entity hub?


There is no such component that stores Entities present in the
ContentItem to the Entityhub, as this is a very uncommon use case.
Typically entities extracted from parsed content are considered as
suggestions.  So one would consider an additional user interaction
(e.g. accepting & storing an Entity) before adding them to the
Entityhub.

However if this is your use case it should be simple to add such an
Enhancement engine.

Since my approach is starting from already existent annotated HTML, Ithought the idea makes sense. Maybe I'm wrong because I'm not that deepinto Stanbol. Please let me know if my idea makes any kind of sense.

What I want to reach is to get an index that stores the extracted entities
and also the document itself with references on the entities related to this
document. It would be great if that could be done by configuration only.


If the htmlextractor engines adds extracted Entities to the metadata
of the ContentItem you can access them with the LDPath configuration
of the Contenthub.

I don't have any experiences using LDPath but it sounds like it would beeasy doing what you wrote and it would be valuable to spend some time onexperimenting on it. Is there a good starting point working with LDPathtogether with Stanbol?

Maybe someone could lend me a hand with building the right enhancement chain
as a first step.


If you just want the Enhancer to extract Microdata you can create an
chain that only contains the htmlextractor engine

Sounds straight forward to me!

In my mind is building up a Solr Search UI offering entity based
autosuggestion including spellchecker and faceted search.


For Entity based autosuggestion you might want to use the Entityhub.
The rest should be possible by using the Contenthub.

As I said before I urgently have to start some experiments ...


Thanks again.

Am 16.03.2013 16:29, schrieb Walter Kasper:

Dear Rüdiger,

The htmlextractor enhancement engine provides a microdata extractor that
should work well for schema.org annotations. Just test it with your data.

Best regards,

Walter

Rüdiger Kurz wrote:


Hello Stanbolers,

I want to extract and then store entities from HTML documents that are
using Microdata annotations based on the type hierarchy of schema.org
as Ontology. I appreciate any kind of approach including the use of VIE.

Many thanks in advance
Rüdiger


--
Rüdiger Kurz

-------------------

Alkacon Software GmbH - The OpenCms Experts
Rüdiger Kurz
An der Wachsfabrik 13
50996 Koeln, DE

Tel: +49 (0)2236 3826-16
Fax: +49 (0)2236 3826-20
Email: [email protected]

http://www.alkacon.com
http://www.opencms.org

Geschäftsführer: Alexander Kandzior, Amtsgericht Köln, HRB 54613

Re: How to extract entities from documents using Microdata?

Reply via email to