Hi Juha As those are interesting questions, I am forwarding the mail to the kim-discussion mailing list also. My answers are inline.
On Jan 24, 2011, at 7:46 PM, <[email protected]> <[email protected]> wrote: > A couple of other questions: > > In section 3 of the same document, first example-box - why the label of the > alias "wkb:Robot_R2D2_1" is "R2D2" i.e. exactly the same as the label for > main Class "wkb:Robot_R2D2" ? I thought that the whole idea of having aliases > is to have also different labels for them. And that would allow the Gazetteer > to collect those different labels and annotate accordingly? Maybe label and > has MainAlias together is redundant and only hasAlias would be enough? Is > this duplication required or just a typo? The text under bullet-point "Source > (generatedBy) implies that entities must have at least one alias to be > included in the dictionary. Does that mean that just having a label is not > enough? > We are touching two separate concepts here. It is generally advisable for all classes to have labels. This is a good RDF design practice. In the label is stored the human readable form of the class. As for the gazetteer, two different models are used - labels and aliases. - Labels have the advantage of being much more light and simple. Labels are sufficient in most cases to express your knowledge. This is the preferred model. - Aliases on the other hand, are much heavier, as each alias is a separate instance itself. This model is used if the there is a need to store some metadata for the labels. For example - multilingual support. Labels are used for visualization, so they are recommended. Whether the instances will have aliases on top of that, depends on the model of the gazetteer that will be used. > In section 3, under the first example-box, you have the notation > "wkb:Robot_R2D2.1" - shouldn't the period be replaced by underscore? This is a naming convention we use. The local name of the instance's URI is formed by the name of its class and the main label of the instance. The instance may have multiple aliases, which URIs are formed by the name of the instance and .number appended at the end. This particular URI means, that this is the first alias of the R2D2 instance of class Robot. Like it is said in the documentation page: "The URIs of the labels, like wkb:Robot_R2D2.1, don't need to be in that exact format, ending in .<number>. They only need to be unique." This is also valid for the URIs of the instances. > > The notation under the next example box, where you are referring to > http://www.ontotext. com/kim/2006/05/wkb#Robot_T.1 is confusing to me. Is > this just a way of stating that "Robot" is a trusted entity? If that is the > case, where should this statement appear? > > The box after this URI seems to be somewhat in contradiction with the Case > Study (DBpedia in KIM). Should it have another statement declaring rdf:type > //proton#Trusted or something? > > This small part just before section 4 is overall a bit unclear to me and I am > wondering if it is missing something... I mean, "generatedBy" seems to serve > a different purpose than the property "Trusted". It might be helpful to spell > these out and maybe explain how exactly each one is being used by the > Gazetteer. Maybe the documentation is a little bit confusing. The only requirement for an entity to be marked as trusted, is that this entity is generated by a source, which is of type protons:Trusted. In the example : wkb:Robot_R2D2 protons:generatedBy wkb:Gazetteer . If you see the rdf for wkb:Gazetteer you will notice: <http://www.ontotext.com/kim/2006/05/wkb#Gazetteer> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://proton.semanticweb.org/2006/05/protons#Trusted> . There are some other trusted sources, defined at the top of wkb.nt . You can also define your own trusted sources. > > Now I have to go home to mind my son but I will be back tomorrow with some > more questions. (the case-study is not entirely clear to me) > > Cheers, > > Juha > Hope this helps Your feedback is valuable to us and helps us improve the documentation all the best Philip Alexiev Software Engineer, KIM team > From: Philip Alexiev @ Ontotext [mailto:[email protected]] > Sent: Monday, January 24, 2011 4:34 PM > To: JUNTTILA Juha (SANCO) > Cc: [email protected]; [email protected] > Subject: Re: Update > > Hi Juha, > > It seems to be a mistake in the documentation. Thank you for pointing it and > excuse us if it caused any difficulties. The Robot class, as is the > description in the RDF, will be a subclass of Object. I will correct this > now. > > Greetings, > Philip > > On Jan 24, 2011, at 5:29 PM, <[email protected]> > <[email protected]> wrote: > >> Hi Philip, >> >> I am studying the first document on your list below and trying to understand >> it before I go ahead banging my head on a brick-wall. One quick question: >> How come the introduction says that the Class "Robot" extends Class >> "Device". I cannot get that from the example... (maybe I am missing >> something or just being illiterate) It seems to me that Robot is exending >> Class "Object" - or maybe even "Entity" i.e. the top class/entity of Proton >> top-module. Is there some kind of a trick there? >> >> Best regards, >> >> Juha >> >> From: Philip Alexiev @ Ontotext [mailto:[email protected]] >> Sent: Thursday, January 20, 2011 8:40 AM >> To: JUNTTILA Juha (SANCO) >> Cc: borislav popov; [email protected] >> Subject: Re: Update >> >> Hi Juha, >> >> Extending the information extraction of KIM is a common task and is >> documented in several places. If you plan to put some efforts into it, you >> should definitely go through those sources: >> - >> http://ontotext.com/kim/doc/KimDocs-3.0-EN/ExtendInformationExtraction.html >> - >> http://ontotext.com/kim/doc/KimDocs-3.0-EN/CaseStudy-IntegrationDbPedia.html >> >> Of course we will provide additional help and explanations. >> >> I will write some more comments inline >> >> >> On Jan 18, 2011, at 2:45 PM, <[email protected]> >> <[email protected]> wrote: >> >>> Thanks a million Borislav, >>> >>> That was a quick and enlightening reply. Maybe we can start with very small >>> and modest steps. If I could just add one simple extension (entity) to the >>> knowledge base, get (at least) one of my own documents processed and see >>> that it has been added to the repository + my new entity has been >>> recognised. After that I would have a much better understanding of how to >>> proceed and how much work it implies. Do you think that might work? >>> >>> I have added some qustions inline. >>> >>> -----Original Message----- >>> From: borislav popov [mailto:[email protected]] >>> Sent: Tuesday, January 18, 2011 10:47 AM >>> To: JUNTTILA Juha (SANCO) >>> Cc: philip; Georgi D. Georgiev >>> Subject: Re: Update >>> >>> Hi Juha, >>> Philip will probably also answer or add, but let me take a shot: >>> i'll >>> insert replies inline - seems more appropriate in this case: >>> >>> On Jan 18, 2011, at 12:05 PM, <[email protected]> >>> <[email protected] >>> > wrote: >>> >>> > Good morning Philip, >>> > >>> > My goal is to set up a system for faceted semantic search of >>> > documents. It seems that KIM would be exactly the tool I have been >>> > looking for. The problem is that the concept/entities we are mainly >>> > interested in are quite different from what has been implemented in >>> > KIM. For example, we are very rarely interested in people and the >>> > GATE pipeline annotating organisations is not able to recognize the >>> > organizations we are interested in. >>> > >>> >>> right. >>> >>> > My primary interest for the last year or so has been to customize >>> > KIM for our purposes and if you remember, we were trying to start >>> > with a pilot on food safety legislation. We didn't really get this >>> > moving on, mainly because I had difficulties in installing KIM. Now >>> > when that problem has been solved and I have KIM up and running we >>> > might finally make some progress. >>> > >>> > Initially I thought (naively) that it is as simple as replacing the >>> > ie.gapp by my own processing pipeline. But after examining the >>> > application it seems to me that it is a bit more than that. Am I >>> > right? The problem is that I would need to: >>> > 1. extend the Proton ontologies with our domain specific ontology >>> > 2. add my own gazetteers which would link to the domain ontology >>> > 3. add/modify the processing pipelines to better recognise entities >>> > that are relevant to our domain >>> > >>> >>> You are right. Especially in the case in which you are interested in >>> other entities, we recommend to shape known entities (e.g. known >>> organisations) you are interested in, as a knowledge base extension. >>> basically saying things like: >>> EC is an Org >>> EC has label European Commission >>> EC has label EC >>> EC isactive in BG, BE, DE, whatever. >>> whatever additional facts are interesting for your use case. >>> Yes, in the first place we could add for example three organisations: >>> Commission, Member State and Competent Authority. That would be simple >>> enough and all we need for annotating the legal texts (these are by far the >>> most common organisations that appear in our legislation). This >>> can be further extended. I suppose the labels will automatically feed into >>> gazetteer lists? >>> >>> At this point of time, I am not quite sure how to do these extensions. I >>> mean, it is obvious that these are RDF statements that need to be somehow >>> linked to Proton but I would need some advice on how to do it... >>> >> The sources I gave show in details exactly how this is done. >> >>> after this it will be quite straightforward to get these things >>> recognized and the search working. Fine tuning like - unknown entity >>> recognition or ambiguity handling and so on - need to be addressed but >>> later. >>> >>> When deciding what you put in the knowledge base and what needs to be >>> annotated please consider: >>> - the queries your users expect I will come back to this later but >>> basically this is about finding mentions of the main entities (see below) >>> appearing close to each other (faceted search) >>> - the structured data you already have access to. What kind of structured >>> data do you mean? >> Data like - lists of authorities, lists of foods etc. You probably don't >> have them as RDF, but as plain text lists. >>> >>> i wouldn't go directly into modeling it as RDF, but keep it in tables >>> until it is crystal clear what i have in my hands and then semi- >>> automatically transform this into RDF. >>> I am not quite sure what you mean by keeping it in tables. Any chances of >>> using Protege for modelling? Can I do the extensions myself or should I >>> rely on your help? >> Keeping it in plain tables (or lists) in a textual document for example. >> Tables are easy to modify and give you the whole picture at a glance. That >> way you can model your data design with less efforts. Unfortunately ontology >> exploring/modeling software is not quite mature yet and using it to build >> the ontology from scratch may be cumbersome. That is why it is generally a >> better idea to clear the concept and then start putting data into RDF. >>> >>> > My perception is that IE.gapp is not entirely de-coupled from the >>> > rest of KIM and it would take a considerable amount of work to get >>> > things right. Is that right? >>> > >>> >>> this is true - nowadays mainly in the final parts of the pipeline - >>> where you are inserting what you found as entities, facts and >>> relationships between the entities and documents as RDF statements in >>> the index/semantic repository. >>> This can be avoided - basically you can get/create any GATE compliant >>> pipeline and wrap it inside. In fact it does not even need to be GATE >>> compliant - but wrapped as such. >>> Does this mean that once I have the KB extended and a simple pipeline >>> wrapped in, I could see something new in KIM Web UI? >>> >>> I will definitely need a little bit more guidance how to do this wrapping... >>> >> This means that you can chose among different approaches. You can base your >> IE on the existing KIM pipeline. Or you can create your pipeline from >> scratch. And even you can create it with GATE and run it in KIM (you will >> probably need to add some resources/plugins to KIM). This is true, because >> the information extraction we use is based on GATE. You are even able to run >> the pipeline KIM uses with GATE, by running KIM/bin/kim gate . >> >>> >>> > Another thing is that Populator requires an XML file for each text - >>> > this is an extra burden and it would be nice to get around that >>> > requirement. Is that absolutely necessary? There is another "gapp" >>> > file in same directory with IE.gapp, which name kind of hinted that >>> > there might be another way... >>> >>> There is another way of course. First if you do not need the metadata >>> in hte XML = no need to create these. This is just one way. In many >>> cases we are not using these XMLs >>> Alternatively you can obtain a stub client code from us to be able to >>> get your docs from wherever they are - e.g. the Web and pass them with >>> metadata you decide to the KIM Server API for annotation and indexing. >>> You can do this via the Java API or through a Web service we can >>> provide. If the metadata is embedded in the document, e.g. HTML meta >>> tags in web pages - you can add specific handling inside of the >>> corresponding file format wrapper in GATE / KIM. >>> >>> As regards legal texts, there is definitely some metadata that would be >>> useful to extract. The thing is that I definitely don't want to do it >>> manually... >>> See for example, the following: >>> http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:32004R0882R(01):EN:HTML >>> At least the meta-tags containing "DC.content", "DC.source" and >>> "DC.description" would probably be useful. >>> > >>> > I chose legal texts as the first example for the specific reason >>> > that I wouldn't need to start with tweaking a JAPE grammar to >>> > recognize organisations i.e. Competent Authorities. The main >>> > entities of interest in legal texts for faceted search are: >>> > - hazards: chemical, biological and physical agents causing health >>> > risks; >>> > - activities: milling, heating, farming, rearing, slaughtering etc.; >>> > - operators: animal keepers, feed producers, importers, wholesalers, >>> > caterers etc; >>> > - requirements: labelling, record keeping, own-controls, sampling, >>> > monitoring etc.; >>> > - production stages: primary production, processing, transport, >>> > storage etc.; and >>> > - commodities: meat, milk, vegetables, fish, eggs, cereals, nuts etc. >>> > >>> > These could be regarded as super-classes, which then have a number >>> > of sub-classes each and would be linked by various properties. >>> >>> this is an ultimately exciting picture you are drawing. For all of >>> these you can get entities and their synonyms, although it looks huge. >>> I have already done some work on this and collected vocabulary for >>> gazetteer-lists. Is the current KIM version using external gazetteer lists >>> or extracting the terms from rdf:labels? >>> >>> >>> >> the gazetteer fills its lists from the semantic repository and he >> automatically puts semantic metadata into the annotations. Aliases or labels >> (depending on the model ) are used for recognition. Generally it is a much >> better idea to integrate with the existing ontology and model your data as >> RDF. The whole model of KIM is based on that, although different approaches >> are not impossible. >>> > >>> > My ontology is still in the process of being developed and it would >>> > be essential to have a trial before further development. >>> > >>> > The practical problem I am trying to solve, is that food safety >>> > legislation is very extensive and it is difficult for anybody to >>> > have a good grip of everything contained therein. It is also >>> > changing continuously and keeping yourself up-to-date with all >>> > amendments is a herculean task. Therefore, it would be very useful >>> > to have a search engine that would give you a glance at e.g. all >>> > Articles containing requirements on sampling at farm level - >>> > possibly narrowed down by the type of hazard (e.g. microbiological >>> > sampling). This is just an example but hopefully gives you some idea. >>> > >>> > For future I have more challenging problems to tackle but this could >>> > be a start. My dream is to get something simple up and running >>> > first, demonstrate the value of it to my colleagues and then >>> > possibly get some more resources to do further development. I am >>> > convinced that this is the way to go and semantic technologies are >>> > essential for us if we want to do things better and more effeciently >>> > in the future. >>> >>> This is challenging enough Juha. >>> Best results are achieved if we make a joint team involving you/your >>> team and people from our team. We understand the challenges you are >>> facing and also that you need to prove something is worthy before >>> getting resources for it. >>> We are willing to support you and guide you outside the frame of a >>> contract - but of course with limited resources, as a lot of things >>> are going on. >>> If you think you can find any resources for a contractual frame around >>> this work - we will be able to dedicate resources and help you >>> intensely. >>> If you decide you need to discuss the possibilities - we can talk via >>> skype or phone. >>> all the best >>> borislav >>> I understand your resource constraints and I really appreciate the help I >>> have got so far. Let's see if we can work out something simple and then we >>> will have a better view where to go. And yes, we can definitely talk via >>> skype whenever needed. >>> > >>> > Cheers, >>> > >>> > Juha >>> > >>> > -----Original Message----- >>> > From: Philip Alexiev @ Ontotext [mailto:[email protected]] >>> > Sent: Tuesday, January 18, 2011 8:55 AM >>> > To: JUNTTILA Juha (SANCO) >>> > Cc: [email protected] >>> > Subject: Re: Update >>> > >>> > Hello Juha >>> > >>> > We are available and willing to help. We are not familiar with what >>> > exactly your goals are, so if you give more information it will be >>> > useful. >>> > >>> > all the best >>> > >>> > Philip Alexiev >>> > Software Engineer, KIM team >>> > >>> > >>> > On Jan 17, 2011, at 7:01 PM, <[email protected]> wrote: >>> > >>> >> Hello, >>> >> >>> >> I have now a couple of succesful installation of the latest version >>> >> of KIM and I would have some questions. Are you guys still >>> >> available for consultation? It seems that tweaking the system to >>> >> suit my needs is not quite as straightforward as I initially >>> >> thought. And I'm afraid it takes me a long time if I try to do it >>> >> on my own... (it seems to work fine with the example set of texts >>> >> and I find it a really fascinating application and could be of >>> >> great interest for us) >>> >> >>> >> Best regards, >>> >> >>> >> Juha >>> >> >>> >> -----Original Message----- >>> >> From: Philip Alexiev [mailto:[email protected]] >>> >> Sent: Wednesday, June 16, 2010 9:39 PM >>> >> To: JUNTTILA Juha (SANCO) >>> >> Subject: Re: Update >>> >> >>> >> Please make an archive with KIM and Tomcat on your machine and >>> >> provide a >>> >> way for me to get it. Also provide information about your operating >>> >> system and browser. >>> >> >>> >> This is the first time I have encountered such a problem. >>> >> >>> >> Greetings, >>> >> Philip >>> >> >>> >> On 06/16/2010 06:04 PM, [email protected] wrote: >>> >>> Actually it does - now I have kim-web-ui visible again. It seems >>> >>> that the problem was that I deleted the contents of $TOMCAT_HOME/ >>> >>> webapps/KIM but not the folder itself. After deleting the folder, >>> >>> Tomcat finds KIM again. But KIM still seems to be empty of content >>> >>> even if I try to populate it with toolPopulate. And it still gives >>> >>> the message "ready with errors on page". >>> >>> >>> >>> -----Original Message----- >>> >>> From: Philip Alexiev [mailto:[email protected]] >>> >>> Sent: Wednesday, June 16, 2010 3:51 PM >>> >>> To: JUNTTILA Juha (SANCO) >>> >>> Subject: Re: Update >>> >>> >>> >>> Does restarting KIM ($KIM_HOME/bin/stopKIM , $KIM_HOME/bin/ >>> >>> startKIM ) >>> >>> and restarting tomcat help ? >>> >>> >>> >>> >>> >>> On 06/16/2010 05:45 PM, [email protected] wrote: >>> >>> >>> >>>> OK - I followed your instructions and emptied all of the three >>> >>>> folders. It seems now that KIM is not responding when I >>> >>>> enterhttp://localhost:8080/KIM/ >>> >>>> into my browser. The browser finds sesame-web-ui and Tomcat but >>> >>>> not KIM. Should I give up and wait until next week? >>> >>>> >>> >>>> >>> >>> >>> >>> >>> >> >>> >> >>> >> -- >>> >> Philip Alexiev<[email protected]> >>> >> Software Engineer >>> >> Ontotext AD >>> >> >>> > >>> >>> >> > _______________________________________________ Kim-discussion mailing list [email protected] http://ontotext.com/mailman/listinfo/kim-discussion
