Hi Rafa,

the documents in scope are HTML output of some application, with repeating 
structure. Think about multiple product catalog from different vendors, what 
you try to index.
One can go and define parts of these pages with XPath or CSS selectors and give 
a meaning of the string extracted from there. An example is a product catalog 
page where one can find company names and technology terms as maker and 
specification.

The approach I intend to follow with the above example is
1. mark a part of the source page with XPath or CSS selector expressions
2. define attributes (example “engine_type”) and source it with the result of 
the XPath query.
3. utilise Stanbol to resolve the “engine_type” string to an entity in the 
entityhub

As for 3. I would use Stanbol, my original question is about whether creating a 
wrapper for an existing scraper engine and implement 1. and 2. within Stanbol 
such way, would be an option I should consider and in this case what would be 
the best architecture.
I hope it makes more sense.

Thanks,
Barna
> Hi Barnabas, Could you provide more details about how your documents are 
> structured and would be the format of the extracted strings using the 
> scraper? Regards, Rafa El 25/11/13 10:37, Barnabas Szasz escribió: > Dear 
> Stanbol community, > > I have a use case where I need to extract metadata 
> from structures (mostly HTML or XML), where the position determines the 
> meaning. Since entity recognition is part of the task (so the extracted 
> strings should be resolved against vocabularies) I am considering Stanbol for 
> this job. > Now the question is where a scraper (like scraperwiki.com) would 
> fit in such an architecture? Shall I implement a wrapper for the scraper as 
> an enhancer? In this case if an engine adds annotation to the document in the 
> chain, would the next engine in the chain be able to do entity recognition on 
> the annotation? > Or would you recommend a different approach? > > Thanks, > 
> Barna >
>  
>  
>  


Reply via email to