Re: PDF Description Extraction For Linked data

Rafa Haro Thu, 29 May 2014 00:23:24 -0700

Hi Maatari,

El 29/05/14 02:27, Maatari Daniel Okouya escribió:

Rafa,
Many thanks for your elaborated answer.
It seems to me that from your elaborated answer i did not completelygraps the concepts behind StanBol. Its primary purpose is semanticallyannotating the content of a file for the purpose of semantic search.Although one could divert by reusing the enhancing infrastructure toget the description generated and apply some Sparql rule to get thedescription in a format desire. It is not geared toward linked dataout of the box. What i mean generating a description that you couldpublish as is, which is what i was looking for. As you say, the bestmatch here is the description returned by the Topic annotation engineand maybe few things extracted by Tika.

Well, the primary purpose or use case wouldn't have to be necessarilySemantic Search. I would say that Stanbol helps in the task ofextracting semantic metadata from content (semantic lifting). It is truethat the most common way of metadata extraction is the Entity Linkingand there is a reason for that: stanbol was born as a tool for ContentManagement Systems where companies are supposed to manage domainvocabularies that could be used to enrich the enterprise content.Anyway, the enhancer has been modularized around extracting engines, soyou can perfectly implement an engine for your use case and takeadvantage of the Stanbol APIs to express your extracted metadata as RDF.

I mean i still need to read a bit, but this is what i get for now,from your explanation and my readings.
Am I close ?

I think so :-). Cheers

Rafa

Best,
-M-
--
Maatari Daniel Okouya
Sent with Airmail
On 28 May 2014 at 13:46:00, Rafa Haro ([email protected]<mailto:[email protected]>) wrote:
Hi Maatari,

El 27/05/14 21:05, Maatari Daniel Okouya escribió:
> Hi ,
>
> Completing my previous question, I think it would be better for meto give the bigger picture of what i’m trying to achieve.
>
>
> I have been charge with helping in disseminating the publicationscontent of my organisation. Most of them are in PDF.
>
> Therefore, I need a process to produce a meaningful RDF descriptionof our content that links as much as possible to the LOD cloud andLOV (liked open vocab). Hence i need to use common core vocabulariesas much as i can i.e. dublin, schema.org, Bibo, FOAF, etc… andreference entity from DBpedia for instance.
>
> Searching around the web how to automatically generate thesedescriptions which would include creator, publisher, primaryTopic,subject, thematic etc…. It seems to me that Apache StanBol was thebest match.
With Stanbol you can enrich your content with your own vocabularies or
dataset from the LOD cloud as long as you import them before as a site.
Let's say that "out of the box" enrichment process consist on linking
pieces of texts (like entities/concepts' names/labels) with entities
within your datasets.
>
> So that’s it, in the first place I would like to automaticallygenerate some rich description about my Pdf publication. not richtho. We are not yet planing on providing semantic search. It willprobably come in the future.
I would say that what you need is not related to Entity Linking for now.
The closer resource that you can use in Stanbol for categorizing your
content in that way is the Topic Annotation Engine which is able to
classify your content according to a pre-trained model using a certain
set of categories. Those categories should correspond to concepts from a
Stanbol site. Please, note that things like primaryTopic. subject,
thematic... are usually not possible to be extracted without training a
model first with already annotated content. There are, of course,
unsupervised alternatives like Latent Semantic Analysis or Latent
Dirilecht Allocation that can be used to extract main terms as topics
for your content, but currently there is not support for those inStanbol.
>
> however for now, i’m interested in providing some biblio graphicdata, and state the main topics of the publication i.e. what does ittalk about generally speaking
If the PDFs have correct metadata, you can use Tika for extracting.
Probably some one in the list can correct me but, as far as I know
current Tika engine in Stanbol is used to extract the content for later
enrich it, but it is not mapping extracted metadata to RDF. I'm not 100%
sure about this but, anyway, to implement it shouldn't be complex.
>
> I will then deploy those description in sparql endpoint, use afrontend like pubby, and do some content negotiation to redirecttoward my pdf when requested. This means also that my descriptionneed to have some specific url that i provide them with.
In the 0.12 branch of Stanbol, there is a component called ContentHub
which is able to automatically store the content metadata as RDF along
with the enhancements providing also an SPARQL endpoint. If you are
planning to store huge volumes of data, probably then the best idea is
to take the RDF response of the enhancer and store it in your own triple
store.
>
>
> Can any one give me some pointers? Is it possible to do that withStanBol, if yes how should i go for it ? How to configure theenhancer for that ?
>
>
> Many thanks,
>
> -M-
>
>
> --
> Maatari Daniel Okouya
> Sent with Airmail

Re: PDF Description Extraction For Linked data

Reply via email to