Hi, most engines in Stanbol can only handle plain text. To support other formats we use the Tika engine which converts binary formats like PDF into plain text.
I do not know what happens with HTML content right now in the Tika engine. We had discussions in the past that Stanbol should support to receive RDFa annotated HTML, strip of the HTML tags, enhance the text, re-add the HTML tags and add the new enhancements as RDFa by preserving the existing RDFa. Maybe the existing RDFa could also be used as an important input for some engines. It is the case where already some metadata exist that could be used by Stanbol. But such a cool feature would require a new engine. Best, - Fabian 2012/12/11 David Riccitelli <[email protected]> > Hello, > > Does Stanbol currently support the analysis of the content of a URL? > > If yes, how does this work according to the different content types, e.g.: > 1. for text/plain does it fetch and analyse the whole text? > 2. for text/html does it fetch and analyse only the TITLE and the BODY > (stripped of the HTML tags)? > 3. are other content types supported? > > Thanks, > David > > -- > David Riccitelli > > > ******************************************************************************** > InsideOut10 s.r.l. > P.IVA: IT-11381771002 > Fax: +39 0110708239 > --- > LinkedIn: http://it.linkedin.com/in/riccitelli > Twitter: ziodave > --- > Layar Partner Network< > http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 > > > > ******************************************************************************** > -- Fabian http://twitter.com/fctwitt
