Hi Barnabas,

In my opinion, having a scraper as a pre-processing component in a Enhancement Chain would be very great. I suppose that, in such chain, you will be creating a ContentItem for analysis with different parts, which AFAIK is not currently supporter in the enhancer. You would need to adapt also the current engines to work with "Multi-Content" Content Items.

Anyway, considering your use case, it seems that you need to annotate better than enrich the different parts of your documents. According to your example, the extracted attributes would consist on single names rather than a text, so in that case maybe it is better to directly use the EntityHub API for lookup the entities for categorizing the attributes.

Hope that helps.

Regards,
Rafa

El 25/11/13 16:32, Barnabas Szasz escribió:

Hi Rafa,

the documents in scope are HTML output of some application, with repeating 
structure. Think about multiple product catalog from different vendors, what 
you try to index.
One can go and define parts of these pages with XPath or CSS selectors and give 
a meaning of the string extracted from there. An example is a product catalog 
page where one can find company names and technology terms as maker and 
specification.

The approach I intend to follow with the above example is
1. mark a part of the source page with XPath or CSS selector expressions
2. define attributes (example “engine_type”) and source it with the result of 
the XPath query.
3. utilise Stanbol to resolve the “engine_type” string to an entity in the 
entityhub

As for 3. I would use Stanbol, my original question is about whether creating a 
wrapper for an existing scraper engine and implement 1. and 2. within Stanbol 
such way, would be an option I should consider and in this case what would be 
the best architecture.
I hope it makes more sense.

Thanks,
Barna
Hi Barnabas, Could you provide more details about how your documents are structured and would be the 
format of the extracted strings using the scraper? Regards, Rafa El 25/11/13 10:37, Barnabas Szasz 
escribió: > Dear Stanbol community, > > I have a use case where I need to extract metadata from 
structures (mostly HTML or XML), where the position determines the meaning. Since entity recognition is 
part of the task (so the extracted strings should be resolved against vocabularies) I am considering 
Stanbol for this job. > Now the question is where a scraper (like scraperwiki.com) would fit in such an 
architecture? Shall I implement a wrapper for the scraper as an enhancer? In this case if an engine adds 
annotation to the document in the chain, would the next engine in the chain be able to do entity 
recognition on the annotation? > Or would you recommend a different approach? > > Thanks, > 
Barna >




Reply via email to