Hi,

most engines in Stanbol can only handle plain text. To support other
formats we use the Tika engine which converts binary formats like PDF into
plain text.

I do not know what happens with HTML content right now in the Tika engine.

We had discussions in the past that Stanbol should support to receive RDFa
annotated HTML, strip of the HTML tags, enhance the text, re-add the HTML
tags and add the new enhancements as RDFa by preserving the existing RDFa.
Maybe the existing RDFa could also be used as an important input for some
engines. It is the case where already some metadata exist that could be
used by Stanbol. But such a cool feature would require a new engine.

Best,
 - Fabian


2012/12/11 David Riccitelli <[email protected]>

> Hello,
>
> Does Stanbol currently support the analysis of the content of a URL?
>
> If yes, how does this work according to the different content types, e.g.:
>  1. for text/plain does it fetch and analyse the whole text?
>  2. for text/html does it fetch and analyse only the TITLE and the BODY
> (stripped of the HTML tags)?
>  3. are other content types supported?
>
> Thanks,
> David
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
>
> ********************************************************************************
>



-- 
Fabian
http://twitter.com/fctwitt

Reply via email to