I just found out that according to
http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.mdthe
min-score can actually be set to 0 and all entities will be indexed
:).
So, I'll give that a go ( hopefully my dbpedia index won't become gigantic
in size).


2014-05-25 16:58 GMT+03:00 Cristian Petroaca <cristian.petro...@gmail.com>:

> Hi Rupert.
>
> I'm answering to your suggestions on integrating the yago class labels in
> the dbpedia index in this thread since it's a lot shorter than the other
> one.
>
> For clarity, your suggestions were :
>
> "1. The indexing tool does support LDPath. That means you can import
> all the required RDF files and use LDPath to append the labels of the Yago
> Types directly to the dbpedia entities. This would prevent additional
> lookups to retrieve the types, but also increase the size of the index a
> lot. 2. You could also index the Yago Types and use an additional Entityhub
> lookup to retrieve them. In this case you should first collect all types
> referenced by Entities in the processed text and in a second step retrieve
> the labels. While this means additional lookups it will only load the
> labels for an type once. In addition you could use a cache for types. 3.
> Your engine could use LDPath to retrieve the types. This would require to
> index the data like with option (2) and use a LDPath statement similar to
> (1). It would be the slowest solution (as it requires an additional lookup
> for every extracted entity) but require the least code."
>
> It seems that the best solution would be no 2, so I took that path. But
> I'm having some issues with building the dbpedia index with the yago class
> labels.
>
> I managed to create an .nt file from the data files on the yago site which
> contains the yago class labels. The file has this format :
> <http://dbpedia.org/class/yago/Floret111669786> <
> http://www.w3.org/2000/01/rdf-schema#label> "floret"@en .
> <http://dbpedia.org/class/yago/Servant110582154> <
> http://www.w3.org/2000/01/rdf-schema#label> "retainer"@en .
> <http://dbpedia.org/class/yago/Varietal107900225> <
> http://www.w3.org/2000/01/rdf-schema#label> "varietal"@en .
>
> I compressed this to a .bz2 archive and put it in the
> indexing/resources/rdfdata folder with the rest of them.
>
> After running the indexer I got my dbpedia index but it seems the yago
> class labels are not present in the index. The first clue was that they
> were missing from the indexing/destination/indexed-entities-ids archive.
> Second confirmation came when I tried to retrieve a yago class label by
> calling site.getEntity(yago_class_uri) and the return was null. I should
> mention that the same call works if I want to get a
> http://dbpedia.org/resource/[id] entity.
>
> From what I saw, the indexing process indexes entities only if they are in
> the incoming_links.txt file and only if their score is higher than 2 so I
> guess that's the point where the yago classes were not inserted. From
> looking at the code, the min-score parameter from the minincoming.config
> file cannot be set to 0, or something that would ignore the
> incoming_links.txt ranking and just index everything. So, in this
> situation, is there a solution for getting these yago classes as entities
> in the index?
>
> I'd like to mention that the indexing process did correctly read the
> yago_class_labels.nt file and started to index the entities into Jena.
>
> Thanks,
> Cristian
>
>
>
> 2014-05-07 14:54 GMT+03:00 Cristian Petroaca <cristian.petro...@gmail.com>
> :
>
> Hi Rupert,
>>
>> Ok, I'll resend this mail in this thread. Again, out of habit I sent it
>> in the gigantic "Named entities coreference" thread instead.
>>
>> So, I managed to create a dbpedia index with the yago class information
>> but looking into the yago_types.nt file which assigns yago classes to
>> dbpedia entities I realized that there are no yago class labels present, I
>> just have the class uri like : <
>> http://dbpedia/..something../President1829302/. I also need the class
>> labels so that I can compare them to the noun token's string from the text.
>>
>> I can get the labels from one of the yago downloads here :
>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoMultilingualClassLabels.txt.
>> I'll need another yago download file to map the yago wordnet classes to
>> dbpedia uris. That could be done via a script maybe.
>>
>> Once I have the dbpedia_yago_class_uri -> label file is it possible to
>> integrate this data in the dbpedia index and later be able to query the
>> labels from the 'dbpedia' Site? How would that work in the dbpedia indexing
>> process? What should I change in the mappings.txt file? At first glance it
>> seems that the indexing is done based on the incoming_links.txt entity
>> scoring and in my case I don't want to include triples involving the actual
>> entity but triples invloving a property of the entity (its yago class).
>>
>> Other than that, I saw that someone will be working on integrating YAGO
>> as part of Gsoc 2014. So maybe waiting for that is an option too but I
>> don't know what the extent of the integration will be.
>>
>> Thanks,
>> Cristi
>>
>>
>> 2014-04-30 12:04 GMT+03:00 Rupert Westenthaler <
>> rupert.westentha...@gmail.com>:
>>
>> On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca
>>> <cristian.petro...@gmail.com> wrote:
>>> > Hi All,
>>> >
>>> > I'm currently working on
>>> https://issues.apache.org/jira/browse/STANBOL-1279.
>>> >
>>> > I am using the SiteManager to get a Site with referenceId = "dbpedia"
>>> and
>>> > am querying data related to some NERs (querying by NER label and type).
>>> > This works and I do get results from the dbpedia index.
>>> >
>>> > What I want to do is this :
>>> >
>>> > 1. I want to be able to store and get yago class types in the dbpedia
>>> data.
>>> > This data is stored in the yago-types.nt file from the dbpedia 3.9
>>> > downloads. Is it possible to create a new dbpedia index with the 3.9
>>> files
>>> > using this script
>>> >
>>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
>>> > ?
>>>
>>> yep. Just make suer you change
>>>
>>>     DBPEDIA=http://downloads.dbpedia.org/3.8
>>>
>>> to dbpedia 3.9
>>>
>>> BTW: you can also remove
>>>
>>>         #corrects encoding and recompress using gz
>>>         bzcat ${filename}.bz2 \
>>>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>>>             | gzip -c > ${filename}.gz
>>>         rm -f ${filename}.bz2
>>>
>>> as this is no longer necessary.
>>>
>>> >
>>> > 2. I want to access some specific dbpedia properties such as
>>> > dbpedia-owl:locationCity and others. These are already present in the
>>> > mappingbased_properties_en.nt
>>> > file which is in the fetch_data_en_int.sh script but are not in the
>>> >
>>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt
>>> > file.
>>> > Should I include them there and do a dbpedia index rebuild?
>>>
>>> Exactly. If the size of the created SolrIndex is an issue I recommend
>>> also that you remove properties you do not need.
>>>
>>> >
>>> > I've already described this in the "Named entity coref resolution
>>> based on
>>> > dbpedia" mail thread but I thought of creating a new mail for
>>> visibility
>>> > and for not clogging the other thread.
>>>
>>> The old thread is anyways already much to long. Please make sure that
>>> important points and decisions of that thread are also reflected in
>>> the description of STANBOL-1279
>>>
>>> best
>>> Rupert
>>>
>>> >
>>> > Thanks,
>>> > Cristian
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>> | A-5500 Bischofshofen
>>> | 
>>> REDLINK.CO..........................................................................
>>> | http://redlink.co/
>>>
>>
>>
>

Reply via email to