Re: analysis of a URL

Rupert Westenthaler Tue, 11 Dec 2012 05:01:26 -0800

Hi,

With the Multipart ContentItem API it is already possible to parse
existing metadata to the Stanbol Enhancer. There is even an example on
how to achieve this available as part of the documentation of the
RESTful API of the Enahncer [1]


STANBOL-660 is very specific as it only asks to support parsing the
language by setting the "Content-Language" header of an enhancement
request. So one option to implement that would be to (1) forward
all/selected header fields as EnhancementProperties (2) implement an
EnhancementEngine that takes the "Content-Language" header and
converts it to an fise:TextAnnotation defining the language of the
Content (as defined by STANBOL-613).

@David: Parsing ContentReferences to the Stanbol Enhancer is supported
in the Java API of the Stanbol Enhancer by using ContentReferences [2]
to construct a ContentItem. AFAIK it is not possible to use this
feature via the RESTful API of the Enhancer. However it is used by the
Contenthub. If you have a strong use case for parsing URLs instead of
content to the Enhancer please feel free to open an JIRA issue
regarding that.

best
Rupert


[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/enhancerrest.html#example-4-parse-existing-free-text-annotations
[2] 
http://stanbol.apache.org/docs/trunk/components/enhancer/contentitemfactory.html#ContentReference

On Tue, Dec 11, 2012 at 1:40 PM, Fabian Christ
<[email protected]> wrote:
> Hi,
>
> I just came along STANBOL-660 which requires to send the language of the
> content as an additional information to Stanbol.
>
> We may need to define a standard way how to pass existing metadata to
> Stanbol.
>
> [1] https://issues.apache.org/jira/browse/STANBOL-660
>
>
> 2012/12/11 Fabian Christ <[email protected]>
>
>> Hi,
>>
>> most engines in Stanbol can only handle plain text. To support other
>> formats we use the Tika engine which converts binary formats like PDF into
>> plain text.
>>
>> I do not know what happens with HTML content right now in the Tika engine.
>>
>> We had discussions in the past that Stanbol should support to receive RDFa
>> annotated HTML, strip of the HTML tags, enhance the text, re-add the HTML
>> tags and add the new enhancements as RDFa by preserving the existing RDFa.
>> Maybe the existing RDFa could also be used as an important input for some
>> engines. It is the case where already some metadata exist that could be
>> used by Stanbol. But such a cool feature would require a new engine.
>>
>> Best,
>>  - Fabian
>>
>>
>> 2012/12/11 David Riccitelli <[email protected]>
>>
>>> Hello,
>>>
>>> Does Stanbol currently support the analysis of the content of a URL?
>>>
>>> If yes, how does this work according to the different content types, e.g.:
>>>  1. for text/plain does it fetch and analyse the whole text?
>>>  2. for text/html does it fetch and analyse only the TITLE and the BODY
>>> (stripped of the HTML tags)?
>>>  3. are other content types supported?
>>>
>>> Thanks,
>>> David
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> >
>>>
>>> ********************************************************************************
>>>
>>
>>
>>
>> --
>> Fabian
>> http://twitter.com/fctwitt
>>
>
>
>
> --
> Fabian
> http://twitter.com/fctwitt



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: analysis of a URL

Reply via email to