I just found out that according to http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.mdthe min-score can actually be set to 0 and all entities will be indexed :). So, I'll give that a go ( hopefully my dbpedia index won't become gigantic in size).
2014-05-25 16:58 GMT+03:00 Cristian Petroaca <cristian.petro...@gmail.com>: > Hi Rupert. > > I'm answering to your suggestions on integrating the yago class labels in > the dbpedia index in this thread since it's a lot shorter than the other > one. > > For clarity, your suggestions were : > > "1. The indexing tool does support LDPath. That means you can import > all the required RDF files and use LDPath to append the labels of the Yago > Types directly to the dbpedia entities. This would prevent additional > lookups to retrieve the types, but also increase the size of the index a > lot. 2. You could also index the Yago Types and use an additional Entityhub > lookup to retrieve them. In this case you should first collect all types > referenced by Entities in the processed text and in a second step retrieve > the labels. While this means additional lookups it will only load the > labels for an type once. In addition you could use a cache for types. 3. > Your engine could use LDPath to retrieve the types. This would require to > index the data like with option (2) and use a LDPath statement similar to > (1). It would be the slowest solution (as it requires an additional lookup > for every extracted entity) but require the least code." > > It seems that the best solution would be no 2, so I took that path. But > I'm having some issues with building the dbpedia index with the yago class > labels. > > I managed to create an .nt file from the data files on the yago site which > contains the yago class labels. The file has this format : > <http://dbpedia.org/class/yago/Floret111669786> < > http://www.w3.org/2000/01/rdf-schema#label> "floret"@en . > <http://dbpedia.org/class/yago/Servant110582154> < > http://www.w3.org/2000/01/rdf-schema#label> "retainer"@en . > <http://dbpedia.org/class/yago/Varietal107900225> < > http://www.w3.org/2000/01/rdf-schema#label> "varietal"@en . > > I compressed this to a .bz2 archive and put it in the > indexing/resources/rdfdata folder with the rest of them. > > After running the indexer I got my dbpedia index but it seems the yago > class labels are not present in the index. The first clue was that they > were missing from the indexing/destination/indexed-entities-ids archive. > Second confirmation came when I tried to retrieve a yago class label by > calling site.getEntity(yago_class_uri) and the return was null. I should > mention that the same call works if I want to get a > http://dbpedia.org/resource/[id] entity. > > From what I saw, the indexing process indexes entities only if they are in > the incoming_links.txt file and only if their score is higher than 2 so I > guess that's the point where the yago classes were not inserted. From > looking at the code, the min-score parameter from the minincoming.config > file cannot be set to 0, or something that would ignore the > incoming_links.txt ranking and just index everything. So, in this > situation, is there a solution for getting these yago classes as entities > in the index? > > I'd like to mention that the indexing process did correctly read the > yago_class_labels.nt file and started to index the entities into Jena. > > Thanks, > Cristian > > > > 2014-05-07 14:54 GMT+03:00 Cristian Petroaca <cristian.petro...@gmail.com> > : > > Hi Rupert, >> >> Ok, I'll resend this mail in this thread. Again, out of habit I sent it >> in the gigantic "Named entities coreference" thread instead. >> >> So, I managed to create a dbpedia index with the yago class information >> but looking into the yago_types.nt file which assigns yago classes to >> dbpedia entities I realized that there are no yago class labels present, I >> just have the class uri like : < >> http://dbpedia/..something../President1829302/. I also need the class >> labels so that I can compare them to the noun token's string from the text. >> >> I can get the labels from one of the yago downloads here : >> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoMultilingualClassLabels.txt. >> I'll need another yago download file to map the yago wordnet classes to >> dbpedia uris. That could be done via a script maybe. >> >> Once I have the dbpedia_yago_class_uri -> label file is it possible to >> integrate this data in the dbpedia index and later be able to query the >> labels from the 'dbpedia' Site? How would that work in the dbpedia indexing >> process? What should I change in the mappings.txt file? At first glance it >> seems that the indexing is done based on the incoming_links.txt entity >> scoring and in my case I don't want to include triples involving the actual >> entity but triples invloving a property of the entity (its yago class). >> >> Other than that, I saw that someone will be working on integrating YAGO >> as part of Gsoc 2014. So maybe waiting for that is an option too but I >> don't know what the extent of the integration will be. >> >> Thanks, >> Cristi >> >> >> 2014-04-30 12:04 GMT+03:00 Rupert Westenthaler < >> rupert.westentha...@gmail.com>: >> >> On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca >>> <cristian.petro...@gmail.com> wrote: >>> > Hi All, >>> > >>> > I'm currently working on >>> https://issues.apache.org/jira/browse/STANBOL-1279. >>> > >>> > I am using the SiteManager to get a Site with referenceId = "dbpedia" >>> and >>> > am querying data related to some NERs (querying by NER label and type). >>> > This works and I do get results from the dbpedia index. >>> > >>> > What I want to do is this : >>> > >>> > 1. I want to be able to store and get yago class types in the dbpedia >>> data. >>> > This data is stored in the yago-types.nt file from the dbpedia 3.9 >>> > downloads. Is it possible to create a new dbpedia index with the 3.9 >>> files >>> > using this script >>> > >>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh >>> > ? >>> >>> yep. Just make suer you change >>> >>> DBPEDIA=http://downloads.dbpedia.org/3.8 >>> >>> to dbpedia 3.9 >>> >>> BTW: you can also remove >>> >>> #corrects encoding and recompress using gz >>> bzcat ${filename}.bz2 \ >>> | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \ >>> | gzip -c > ${filename}.gz >>> rm -f ${filename}.bz2 >>> >>> as this is no longer necessary. >>> >>> > >>> > 2. I want to access some specific dbpedia properties such as >>> > dbpedia-owl:locationCity and others. These are already present in the >>> > mappingbased_properties_en.nt >>> > file which is in the fetch_data_en_int.sh script but are not in the >>> > >>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt >>> > file. >>> > Should I include them there and do a dbpedia index rebuild? >>> >>> Exactly. If the size of the created SolrIndex is an issue I recommend >>> also that you remove properties you do not need. >>> >>> > >>> > I've already described this in the "Named entity coref resolution >>> based on >>> > dbpedia" mail thread but I thought of creating a new mail for >>> visibility >>> > and for not clogging the other thread. >>> >>> The old thread is anyways already much to long. Please make sure that >>> important points and decisions of that thread are also reflected in >>> the description of STANBOL-1279 >>> >>> best >>> Rupert >>> >>> > >>> > Thanks, >>> > Cristian >>> >>> >>> >>> -- >>> | Rupert Westenthaler rupert.westentha...@gmail.com >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen >>> | >>> REDLINK.CO.......................................................................... >>> | http://redlink.co/ >>> >> >> >