Hi Andrea,
A followup:
(1) Sharing your indexes:
This would be great! I talked with a collage of mine. Most likely we
will add an FTP upload folder to the dev.iks-project.eu server. For
that we will need to add more HDD space to this virtual host what
might take some more time to accomplish. I will notify you as soon as
we are ready
(2) dbp-ont:surfaceForm
I recommended to you to copy labels of redirected pages to the
"dbp-ont:surfaceForm" field. In the meantime I made some tests with an
index build like that. The results where really bad because of that I
must revoke this recommendation!
The reason for that is that the scoring algorithm of Solr is affected
by the multi-valued "dbp-ont:surfaceForm" field. e.g. for
dbpedia:Paris you have ~35 "dbp-ont:surfaceForm" values where only
about ~15 contain "Paris". So if you now make a query for Paris in
this field
(((@en/dbp\-ont\:surfaceForm/:"paris")))
you will notice that dbpedia:Paris is not within the top 10 search
results. Instead Entities like "Paris Barclay" are listed because they
do have only a single value for "dbp-ont:surfaceForm" and therefore
the match for "Paris" is much more relevant.
This means that the current index-layout where URIs of redirected
pages are represented as own Entities within the index is much better
suited for entity extraction.
On Mon, Nov 5, 2012 at 10:59 AM, Andrea Di Menna <[email protected]> wrote:
> Hi Rupert,
> I would be more than happy to share the indexes.
> I have also created one including redirects by forcibly inserting
> redirecting entities into the incoming_links.txt file.
Do you have a script for creating such a incoming_links.txt file?
Because this would be very useful for properly creating indexes that
include Entities of redirected pages.
best
Rupert
> Redirects have been assigned the same entity rank as the entities they
> redirect to.
>
> Please let me know how and where to store those indexes.
>
> Cheers
>
> 2012/11/3 Rupert Westenthaler <[email protected]>
>
>> Hi,
>>
>> I have started to play around with indexing dbpedia 3.8 myself as well
>> and I con confirm that one has to preprocess nearly all files. Because
>> of that I have written a nice shell script that downloads, processes
>> and re-compresses the RDF files
>>
>> # array syntax is ({item-1} {items-2} ... {item-n})
>> # names need to include the language path segment!
>> files=(dbpedia_3.8.owl \
>> en/labels_en.nt \
>> {all-the-other-files-you-need} \
>> )
>>
>> for i in "${files[@]}"
>> do
>> :
>> # clean possible encoding errors
>> filename=$(basename $i)
>> if [ ! -f ${filename}.gz ]
>> then
>> url=${DBPEDIA}/${i}.bz2
>> wget -c ${url}
>> echo "cleaning $filename ..."
>> #corrects encoding and recompress using gz
>> #gz is used because it is faster
>> bzcat ${filename}.bz2 \
>> | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>> | gzip -c > ${filename}.gz
>> rm -f ${filename}.bz2
>> fi
>> done
>>
>> > the SolrIndex zip file is about 3.5GB.
>> > I am using a min-score=2 in minincoming.properties
>> > I think the 3.7 index file from the IKS project downloads site was
>> created
>> > with min-score=10.
>>
>> The dbpedia 3.7 index was build by ogrisel, but I think you are right.
>> 3.5GByte for all entities wih >=2 incomming links (should be about
>> 4million entities) sound reasonable. If you want to share your index
>> with the Stanbol community I am sure we can find a server to host it.
>>
>>
>> Note about languages:
>>
>> while it is easy include labels, comments, abstracts of additional
>> languages it is not so easy to add proper Solr field definition for
>> languages. While there is a great wiki page that provides all the
>> necessary links [1] I find it still very hard to add configurations
>> for languages I do not understand. So if someone can help with that I
>> am happy to improve the Solr schemas used by the Entityhub (and the
>> Entityhub Indexing tool)!
>>
>>
>> Upgrading the default DBpedia index:
>>
>> After the ApacheCon I will work on replacing the default dbpedia index
>> used with the Stanbol launchers with a dbpedia 3.8 based version (the
>> current one is still based on 3.6). This will need some time because I
>> expect that I will need to adapt a lot of unit/integration tests
>> affected by data changes.
>>
>> [1] http://wiki.apache.org/solr/LanguageAnalysis
>>
>> >
>> > I have indexed english resources and labels from other languages, as this
>> > is what I currently need.
>> >
>> > Cheers
>> > Andrea
>> >
>> > 2012/11/2 harish suvarna <[email protected]>
>> >
>> >> Andrea,
>> >> Thanks for the update. I was also trying to create the Chinese and
>> English
>> >> dbpedia3.8 indexes. But ranout hardware power.
>> >> What is the size of the dbpedia.solr.index.zip file? It used to be 1.9
>> GB
>> >> (zip file). But I guess that contained labels from all languages.
>> >>
>> >> Did you index English only?
>> >>
>> >> -harish
>> >>
>> >> On Fri, Nov 2, 2012 at 9:40 AM, Andrea Di Menna <[email protected]
>> >> >wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > I have created a EntityHub Solr index from dbpedia 3.8 using the
>> default
>> >> > settings for the dbpedia indexing tool.
>> >> > The index was created successfully.
>> >> >
>> >> > Now that I working on it I am noticing that wikipedia redirects are
>> >> > completely missing from the EntityHub.
>> >> >
>> >> > I have used the fetch_prepare.sh tool to download data from DBpedia,
>> and
>> >> > among the resources there is also redirects_en.nt.bz2
>> >> > There is a rule in the mappings.txt file to map
>> dbp-ont:wikiPageRedirects
>> >> > to rdfs:seeAlso.
>> >> >
>> >> > From what I can see, the problems seems to be that the indexing tool
>> is
>> >> > only taking into account the resources listed in the
>> incoming_links.txt
>> >> > file.
>> >> > This file is built upon page_links_en.nt.bz2 and ranks entities on the
>> >> > basis of the incoming links.
>> >> > Page redirects will never have incoming links hence will not be
>> listed in
>> >> > incoming_links.txt
>> >> >
>> >> > Is my understanding correct or am I missing anything?
>> >> > Should I forcibly insert page redirects entities in the incoming_links
>> >> file
>> >> > to get them included in the Solr index?
>> >> >
>> >> > Thank you very much for your time
>> >> >
>> >> > --
>> >> > Andrea Di Menna
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > This e-mail is only intended for the person(s) to whom it is addressed
>> >> and
>> >> > may contain CONFIDENTIAL information. Any opinions or views are
>> personal
>> >> to
>> >> > the writer and do not represent those of INQ Mobile Limited, Hutchison
>> >> > Whampoa Limited or its group companies. If you are not the intended
>> >> > recipient, you are hereby notified that any use, retention,
>> disclosure,
>> >> > copying, printing, forwarding or dissemination of this communication
>> is
>> >> > strictly prohibited. If you have received this communication in
>> error,
>> >> > please erase all copies of the message and its attachments and notify
>> >> the
>> >> > sender immediately. INQ Mobile Limited is a company registered in the
>> >> > British Virgin Islands. www.inqmobile.com.
>> >> >
>> >> >
>> >>
>> >>
>> >> --
>> >> Thanks
>> >> Harish
>> >>
>> >
>> >
>> >
>> >
>> > This e-mail is only intended for the person(s) to whom it is addressed
>> and may contain CONFIDENTIAL information. Any opinions or views are
>> personal to the writer and do not represent those of INQ Mobile Limited,
>> Hutchison Whampoa Limited or its group companies. If you are not the
>> intended recipient, you are hereby notified that any use, retention,
>> disclosure, copying, printing, forwarding or dissemination of this
>> communication is strictly prohibited. If you have received this
>> communication in error, please erase all copies of the message and its
>> attachments and notify the sender immediately. INQ Mobile Limited is a
>> company registered in the British Virgin Islands. www.inqmobile.com.
>> >
>>
>>
>>
>> --
>> | Rupert Westenthaler [email protected]
>> | Bodenlehenstraße 11 ++43-699-11108907
>> | A-5500 Bischofshofen
>>
>
>
>
> --
> Andrea Di Menna
> INQ - Engineering
> +393925803119
> skype: ninniux
> inqmobile.com
> INQ¹ – Winner of the 2009 Best Handset
>
>
>
>
> This e-mail is only intended for the person(s) to whom it is addressed and
> may contain CONFIDENTIAL information. Any opinions or views are personal to
> the writer and do not represent those of INQ Mobile Limited, Hutchison
> Whampoa Limited or its group companies. If you are not the intended
> recipient, you are hereby notified that any use, retention, disclosure,
> copying, printing, forwarding or dissemination of this communication is
> strictly prohibited. If you have received this communication in error,
> please erase all copies of the message and its attachments and notify the
> sender immediately. INQ Mobile Limited is a company registered in the
> British Virgin Islands. www.inqmobile.com.
>
>
--
| Rupert Westenthaler [email protected]
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen