Many thanks, Got it.
Best, -M- -- Maatari Daniel Okouya Sent with Airmail On 28 May 2014 at 06:36:52, Rupert Westenthaler (rupert.westentha...@gmail.com) wrote: Hi Maatari, On Tue, May 27, 2014 at 2:53 PM, Maatari Daniel Okouya <okouy...@yahoo.fr> wrote: > Hi, thanks for your answer. > > I mean Topic Annotation. > Currently the only available Topic Classification engine in Stanbol is the one described by [1]. As Stanbol does not ship with pre-trained models (e.g. for IPTC or similar thesauri) you will need to train your own models. [1] also provides an introduction how to do that. This year I am mentor of an GSoC (Google Summer of Code) project that is about defining a clear Topic Classification API [2] [3] and two additional implementations of such engines. > Ultimately what i would like to have is something like: { PDFuri > FoaF:PrimaryTopic London . } as triple in the return RDF. > > But for now, i don’t concern myself with using FOAF. > Topic Engines will always use fise:TopicAnnotation to describe extracted engines. If you just want "{PDF-uri} foaf:primaryTopic {topic-uri}" you can easily get this by taking the topics referenced by fise:TopicAnnotation and linking them using foaf:primaryTopic directly to the ContentIem > I just want to have the main topics of the PDF. I don’t necessarily want to > extract all the entity etc…. > > SO maybe in term of the annotation generated i would say not having > fise:EntityAnnotation neither fise:TextAnnotation but simply > fise:TopicAnnotation > No problem just configure an Enhancement Chain with the * tika engine: to extract plain text from the PDFs * langdetect engine: to detect the language (as alternative you can also parse the language by setting the Content-Language HTTP header in requests) * the topic engine configured with the model you trained. best Rupert [1] http://www.iks-project.eu/sites/default/files/Topic-Classification.pdf [2] http://furkankamaci.com/gsoc-2014-acceptance-apache-stanbol/ [3] https://issues.apache.org/jira/browse/STANBOL-1294 > > -- > Maatari Daniel Okouya > Sent with Airmail > > On 27 May 2014 at 13:08:38, Rupert Westenthaler > (rupert.westentha...@gmail.com) wrote: > > On Tue, May 27, 2014 at 12:49 PM, Maatari Daniel Okouya > <okouy...@yahoo.fr> wrote: >> Hi, >> >> I have just started to use apache stanbol. I’m still playing around with it >> to figure out everything that is out there. However, I’m puzzle by one >> thing. I would like to configure it such that upon uploading a text or a Pdf >> document, an RDF containing only the topic of the pdf shall be returned. >> > > What do you mean by "topic"? In case of PDF files the Tika Engine [1] > can extract metadata. Such metadata are directly added to the URI of > the contentItem and do not use FISE. > >> I’m scratching my head but i don’t see how to do so. What is the engine that >> is suppose to produce <<Fise:Annotation>> >> > > All Stanbol Engines do generate FISE enhancements > (fise:TextAnnotation, fise:EntityAnnotation and fise:TopicAnnotation) > > When you look at the list of engines [2] > > * Language Detection engines create a fise:TextAnnotation describing > the language of the document (?la dc:type dc:LinguisticSystem; ?la > dc:language ?lang) > * Named Entity Recognition (NER) Engines create fise:TextAnnotations > for Entities recognized by the NLP framework. > * Linking / Suggestions create fise:EntityAnnotation for Entities > found in the text. They might also add fise:TextAnnotation to mark the > exact mention of such entities in the text. > * Topic Classification engines use fise:TopicAnnotation to describe > assigned topics. They also use a fise:TextAnnotation to mark the part > of the text the topic is assigned to > >> as described in >> http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html >> > > Yep this page describes the annotations as created by the EnhancementEngines. > > > > Without knowing what you mean by " ... only the topic of the pdf ..." > I can not recommend you suitable Stanbol configurations. > > best > Rupert > >> >> > > > [1] > http://stanbol.apache.org/docs/trunk/components/enhancer/engines/tikaengine > [2] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list > >> I would appreciate if someone could provide me with some pointers. >> >> Many thanks, >> >> Maatary >> >> -- >> Maatari Daniel Okouya >> Sent with Airmail > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen > | REDLINK.CO > .......................................................................... > | http://redlink.co/ -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen | REDLINK.CO .......................................................................... | http://redlink.co/