Re: PDF Description Extraction For Linked data

Rafa Haro Wed, 28 May 2014 04:46:19 -0700

Hi Maatari,

El 27/05/14 21:05, Maatari Daniel Okouya escribió:

Hi ,


Completing my previous question, I think it would be better for me to give the 
bigger picture of what i’m trying to achieve.


I have been charge with helping in disseminating the publications content of my 
organisation. Most of them are in PDF.

Therefore, I need a process to produce a meaningful RDF description of our 
content that links as much as possible to the LOD cloud and LOV (liked open 
vocab).  Hence i need to use common core vocabularies as much as i can i.e. 
dublin, schema.org, Bibo, FOAF, etc… and reference entity from DBpedia for 
instance.

Searching around the web how to automatically generate these descriptions which 
would include creator, publisher, primaryTopic, subject, thematic etc…. It 
seems to me that Apache StanBol was the best match.

With Stanbol you can enrich your content with your own vocabularies ordataset from the LOD cloud as long as you import them before as a site.Let's say that "out of the box" enrichment process consist on linkingpieces of texts (like entities/concepts' names/labels) with entitieswithin your datasets.


So that’s it, in the first place I would like to automatically generate some 
rich description about my Pdf publication. not rich tho. We are not yet planing 
on providing semantic search. It will probably come in the future.

I would say that what you need is not related to Entity Linking for now.The closer resource that you can use in Stanbol for categorizing yourcontent in that way is the Topic Annotation Engine which is able toclassify your content according to a pre-trained model using a certainset of categories. Those categories should correspond to concepts from aStanbol site. Please, note that things like primaryTopic. subject,thematic... are usually not possible to be extracted without training amodel first with already annotated content. There are, of course,unsupervised alternatives like Latent Semantic Analysis or LatentDirilecht Allocation that can be used to extract main terms as topicsfor your content, but currently there is not support for those in Stanbol.


however for now, i’m interested in providing some biblio graphic data, and 
state the main topics of the publication i.e. what does it talk about generally 
speaking

If the PDFs have correct metadata, you can use Tika for extracting.Probably some one in the list can correct me but, as far as I knowcurrent Tika engine in Stanbol is used to extract the content for laterenrich it, but it is not mapping extracted metadata to RDF. I'm not 100%sure about this but, anyway, to implement it shouldn't be complex.


I will then deploy those description in sparql endpoint, use a frontend like 
pubby, and do some content negotiation to redirect toward my pdf when 
requested. This means also that my description need to have some specific url 
that i provide them with.

In the 0.12 branch of Stanbol, there is a component called ContentHubwhich is able to automatically store the content metadata as RDF alongwith the enhancements providing also an SPARQL endpoint. If you areplanning to store huge volumes of data, probably then the best idea isto take the RDF response of the enhancer and store it in your own triplestore.



Can any one give me some pointers? Is it possible to do that with StanBol, if 
yes how should i go for it ? How to configure the enhancer for that ?


Many thanks,

-M-


--
Maatari Daniel Okouya
Sent with Airmail

Re: PDF Description Extraction For Linked data

Reply via email to