Re: [Dbpedia-developers] optimization of Language specific links extraction

Jona Christopher Sahnwaldt Wed, 24 Jul 2013 11:48:51 -0700

I just noticed that the example files contain URIs, but most DBpedia
languages use IRIs now. For example,
http://es.dbpedia.org/resource/Gab%C3%B3n should be
http://es.dbpedia.org/resource/Gabón . You can use
org.dbpedia.util.text.uri.UriDecoder.java to achive this. Just do


uri = UriDecoder.decode(uri)

or something like this. I'm not 100% sure, but I think it should work.
You should have a look at the IRIs resulting from this transformation
and make sure that they are valid IRIs. The simplest way to check this
is probably calling

new java.net.URI(uri)

which will throw an exception if the IRI is invalid.

JC

On 24 July 2013 19:58, Hady elsahar <[email protected]> wrote:
> Sending again because files where too large to upload on the mailing list
>
> That's Great ,
> i implemented it Here:
> https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala
>
> by small profiling i found that 96% of the time taken is in the HardDisk
> access , so any optimization for scaling would need to optimize the hard
> disk access method.
>
> some attached output  files for 1K wikiData items :
> http://db.tt/9jekUq51
> http://db.tt/SE3u0AKD
> http://db.tt/GLoBMRua
>
>
>
>
>
> thanks
> Regards
>
> On Wed, Jul 24, 2013 at 9:10 AM, Dimitris Kontokostas
> <[email protected]> wrote:
>>
>>
>>
>>
>> On Wed, Jul 24, 2013 at 9:42 AM, Hady elsahar <[email protected]>
>> wrote:
>>>
>>> Hello Jona ,
>>>
>>> 1- i asked Dimitris considering if Wikidata Items are sorted or not, i
>>> meant if we can depend if they come in blocks or not maybe i didn't clear
>>> that enough, if we can depend on that , the problem now is solved but could
>>> we ? they needn't be sorted like Q1 comes before Q2 , just in blocks
>>
>>
>> yup, sorted is different from blocks :) even a manual sequence of items (
>> i.e. http://www.wikidata.org/wiki/Special:EntityData/Q1.nt, Q2, ...) will
>> *not* be sorted because of the interwiki links but will be in blocks
>>
>>>
>>> 2- converting WikiData Dumbs into files is very clever that's very useful
>>> thing to know thanks, and considering we use memory mapped files and Random
>>> access files to access files by lines instead of parsing it from the
>>> beginning this should be fast in conversion .
>>>
>>> most of the Title Codes are unique so i prefer using no of bits to
>>> represent the whole URI and access the URI from the main File by Line number
>>> , what would be the problem in that ?
>>>
>>>
>>> so to wrap things up :
>>>
>>> can we depend that Wikidata items come in chunks ? this makes the problem
>>> trivial
>>
>> I also think we can depend on that.
>> For building the interlanguage links this becomes trivial as Jona said but
>> I still think that we will need to build an index later.
>> It's not 100% clear from now what we will need to store then so lets leave
>> it and deal with it when the time comes.
>>
>> Cheers,
>> Dimitris
>>>
>>> if not :
>>>
>>> using Bits to represent Title Codes and languages either ways
>>> processing
>>> replacing Bits with actual DBpedia URI by searching LLfiles and TitleCode
>>>
>>> thanks
>>> Regards
>>>
>>>
>>>
>>>
>>> On Mon, Jul 22, 2013 at 8:16 PM, Jona Christopher Sahnwaldt
>>> <[email protected]> wrote:
>>>>
>>>> Hi Hady, all,
>>>>
>>>> some ideas...
>>>>
>>>> I think we can assume that the entries for the Wikidata items come in
>>>> blocks: first all triples for Qx, then all triples for Qy, and so on. In
>>>> general, it may not be trivial to tell where one block ends and the next 
>>>> one
>>>> begins because the URI for Qx may not appear in all triples - there may be
>>>> triples about blank nodes or similar stuff that is only indirectly 
>>>> connected
>>>> to Qx. But I think it won't be very hard either.
>>>>
>>>> The current task becomes easy if you rely on the data comes in blocks:
>>>> keep collecting inter-language links for Qx. Stop when you encounter the
>>>> first triple for Qy. Then just take the data collected from Qx and generate
>>>> new triples. Pseudo-Turtle as follows:
>>>>
>>>> One file with Wikidata subject URIs:
>>>>
>>>> wikidata:Qx sameAs <http://xx.dbpedia.org/resource/Xx> .
>>>> wikidata:Qx sameAs <http:// yy.dbpedia.org/resource/Yy> .
>>>> wikidata:Qx sameAs <http:// zz.dbpedia.org/resource/Zz> .
>>>> ...
>>>>
>>>> And then one file for each language:
>>>>
>>>> xxwiki-same-as.ttl:
>>>>
>>>> <http://xx.dbpedia.org/resource/Xx>
>>>> sameAs wikidata:Qx
>>>> <http://xx.dbpedia.org/resource/Xx>
>>>> sameAs <http://yy.dbpedia.org/resource/Yy>
>>>> <http://xx.dbpedia.org/resource/Xx>
>>>> sameAs <http://zz.dbpedia.org/resource/Zz>
>>>>
>>>> yywiki-same-as.ttl:
>>>>
>>>> <http://yy.dbpedia.org/resource/Yy>
>>>> sameAs wikidata:Qx
>>>> <http://yy.dbpedia.org/resource/Yy>
>>>> sameAs <http://xx.dbpedia.org/resource/Xx>
>>>> <http://yy.dbpedia.org/resource/Yy>
>>>> sameAs <http://zz.dbpedia.org/resource/Zz>
>>>>
>>>> and so on.
>>>>
>>>> The number of triples produced from each block is still quadratic in the
>>>> number of IL links in each block, but that's not a problem. The total 
>>>> number
>>>> of generated triples T (and thus a lower bound for time complexity) is
>>>> O(L*LI), where L is the number of languages and LI is the total number of 
>>>> IL
>>>> links. LI is O(L*W), where W is the total number of Wikidata items, so T is
>>>> O(L^2*W). L is constant and small - 100-300.
>>>>
>>>> But if you really need to load millions of URIs for hundreds of
>>>> languages into RAM, that will probably be possible with some bit-twiddling.
>>>> For DBpedia 3.8, I had to load all IL links into RAM before I could process
>>>> them, so I wrote ProcessInterLanguageLinks.scala, which ran in an hour or
>>>> so. Here's the description from
>>>> https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/scala/org/dbpedia/extraction/scripts/ProcessInterLanguageLinks.scala
>>>>
>>>> Algorithm:
>>>>
>>>>   Each URI is a combination of language code and title string. There are
>>>> only ~12 million
>>>>   unique title strings in the top ~100 languages, so we save space by
>>>> building an index of title
>>>>   strings and using 27 bits (enough for ~130 million titles) of the
>>>> index number instead of the
>>>>   title string. We use 10 bits (enough for 1024 languages) of the
>>>> language index instead of the
>>>>   language code. Taken together, these 37 bits fit into a Long. The
>>>> lowest 27 bits the title code,
>>>>   the next 10 bits are the language code. -1 is used as the null value.
>>>>
>>>>   A link from a page to another page is represented by a Long value
>>>> which contains the
>>>>   concatenation of the values for the page URIs: the upper 27 bits
>>>> contain the title code
>>>>   for the 'from' URI, the next lower 10 bits contain the language code
>>>> for 'to' URI, and the
>>>>   lowest 27 bits contain the title code for the 'to' URI. All links for
>>>> the 'from' language
>>>>   are stored in one array. To find an inverse link, we swap the highest
>>>> and lowest 27 bits,
>>>>   replace the middle 10 bits by the 'from' language, and search the
>>>> array for the 'to' language
>>>>   for the result. To speed up this search, we sort the array and use
>>>> binary search.
>>>>
>>>>   TODO: it's a waste of space to store each character of each title
>>>> separately. Maybe a trie
>>>>   could reduce space requirements.
>>>>
>>>> Cheers,
>>>> JC
>>>>
>>>> On Jul 22, 2013 6:52 PM, "Dimitris Kontokostas"
>>>> <[email protected]> wrote:
>>>>>
>>>>> Hi Hady,
>>>>>
>>>>> Could you make an estimate on the total size of memory that you need
>>>>> for every 1M Wikidata entries? This will give us a better overview.
>>>>> You are free to make assumptions on the average data that you will need
>>>>> (URI size, language #, ...)
>>>>>
>>>>> I 'd also take a look at the "memory-mapped files" for an alternative.
>>>>> I haven't used them with Java/Scala but from searching a little around
>>>>> there is native support which makes them good candidate.
>>>>>
>>>>> Cheers,
>>>>> Dimitris
>>>>>
>>>>>
>>>>> On Sun, Jul 21, 2013 at 5:03 PM, Hady elsahar <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> after some playing around and couple of consultancy on Stackoverflow
>>>>>> here  and here
>>>>>> the bottle neck here is indexing the triples file for the sake of fast
>>>>>> access instead of line by line going through the file
>>>>>>
>>>>>> alternatives available :
>>>>>>
>>>>>> 1- indexing only lines of each subject in the memory + using fixed
>>>>>> length triples lines and Random Access file to acess lines by specific 
>>>>>> lines
>>>>>> fast
>>>>>> 2- using a key-value store like Redis or something like SQLlight
>>>>>> 3- sorting the file using Merge sort and then we don't need
>>>>>> 4- using Map reduce
>>>>>>
>>>>>> i am implementing the first one and testing it's reliability on large
>>>>>> data , though it seems like a hack but i guess it's suitable cuz it 
>>>>>> could be
>>>>>> portable and needn't to install any libraries or infrastructure.
>>>>>>
>>>>>> what do you think best thing we should go through ? any other
>>>>>> suggestion?
>>>>>> i always faced such problems and solved them by hacks and workarounds
>>>>>> but i always wondered what is the state of the art of dealing with such
>>>>>> problems if there's a standard for that. how do you guys in DBpedia 
>>>>>> tackle
>>>>>> such things ?
>>>>>>
>>>>>> thanks
>>>>>> Regards
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 18, 2013 at 10:43 AM, Dimitris Kontokostas
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi Hady,
>>>>>>>
>>>>>>> You could re-use a lot of already defined utility functions for files
>>>>>>> & triple parsing  but you are not so familiar with the framework yet so 
>>>>>>> that
>>>>>>> will come in time
>>>>>>> See inline for your questions
>>>>>>>
>>>>>>> On Thu, Jul 18, 2013 at 12:57 AM, Hady elsahar
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hello all ,
>>>>>>>>
>>>>>>>> Hoping that everyone is enjoying the summer ,
>>>>>>>>
>>>>>>>> I've written a scala script to do the task to generate LLlinks
>>>>>>>> specific files to be uploaded as mentioned by JC here
>>>>>>>>
>>>>>>>> option 0 in the script is for extracting the master LL file
>>>>>>>> option 1 is for extracting language specific links files
>>>>>>>>
>>>>>>>> the first iteration of the code is of complexity O(n^2) , where n is
>>>>>>>> the lines in the master LL file ,it seems so Dumb and would take a lot 
>>>>>>>> of
>>>>>>>> time when running it on the big dumb, there's a lot of easy ways to 
>>>>>>>> optimize
>>>>>>>> this but i had some questions :
>>>>>>>>
>>>>>>>> 1- could we depend that the triples RDF dump will be in order ?
>>>>>>>> ie.(for example all Q1000 entity triples will come after each other 
>>>>>>>> and we
>>>>>>>> don't need to parse the rest of the file for related triples )
>>>>>>>
>>>>>>>
>>>>>>> In general no. If you need them that way you can add a "sort" step in
>>>>>>> the process pipeline
>>>>>>>
>>>>>>>>
>>>>>>>> 2- in that task which is better to optimize , memory vs time ?,
>>>>>>>> loading file in a HashMap will optimize the speed a lot , but it may 
>>>>>>>> take
>>>>>>>> some memory.
>>>>>>>
>>>>>>>
>>>>>>> We'd prefer time but it always depends. A few extra GB of memory
>>>>>>> should be acceptable but if you want to load a map with all WikiData 
>>>>>>> entries
>>>>>>> that will not scale well
>>>>>>>
>>>>>>>>
>>>>>>>> 3-just for the sake of curiosity and setting standards , the
>>>>>>>> Language links extraction process in wikipedia , how much does it take 
>>>>>>>> in
>>>>>>>> terms of time and do we dedicate special server for that ? or it 
>>>>>>>> doesn't
>>>>>>>> need it's just a small process ?
>>>>>>>
>>>>>>>
>>>>>>> It's a small task compared to the wikipedia extraction. In the scale
>>>>>>> of only the language chapters it's around 15-30 minutes. But the 
>>>>>>> initial ILL
>>>>>>> dump is created with the extraction process so it's not directly 
>>>>>>> comparable
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Dimitris
>>>>>>>
>>>>>>>>
>>>>>>>> 4- any suggestions could be great
>>>>>>>>
>>>>>>>> thanks
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> -------------------------------------------------
>>>>>>>> Hady El-Sahar
>>>>>>>> Research Assistant
>>>>>>>> Center of Informatics Sciences | Nile University
>>>>>>>>
>>>>>>>> email : [email protected]
>>>>>>>> Phone : +2-01220887311
>>>>>>>> http://hadyelsahar.me/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> See everything from the browser to the database with AppDynamics
>>>>>>>> Get end-to-end visibility with application monitoring from
>>>>>>>> AppDynamics
>>>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>>>
>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>>>> _______________________________________________
>>>>>>>> Dbpedia-developers mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Dimitris Kontokostas
>>>>>>> Department of Computer Science, University of Leipzig
>>>>>>> Research Group: http://aksw.org
>>>>>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -------------------------------------------------
>>>>>> Hady El-Sahar
>>>>>> Research Assistant
>>>>>> Center of Informatics Sciences | Nile University
>>>>>>
>>>>>> email : [email protected]
>>>>>> Phone : +2-01220887311
>>>>>> http://hadyelsahar.me/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> See everything from the browser to the database with AppDynamics
>>>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>> _______________________________________________
>>>>>> Dbpedia-developers mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dimitris Kontokostas
>>>>> Department of Computer Science, University of Leipzig
>>>>> Research Group: http://aksw.org
>>>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> See everything from the browser to the database with AppDynamics
>>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>> Start your free trial of AppDynamics Pro today!
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Dbpedia-developers mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>
>>>
>>>
>>>
>>> --
>>> -------------------------------------------------
>>> Hady El-Sahar
>>> Research Assistant
>>> Center of Informatics Sciences | Nile University
>>>
>>> email : [email protected]
>>> Phone : +2-01220887311
>>> http://hadyelsahar.me/
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>
>>
>>
>> --
>> Dimitris Kontokostas
>> Department of Computer Science, University of Leipzig
>> Research Group: http://aksw.org
>> Homepage:http://aksw.org/DimitrisKontokostas
>
>
>
>
> --
> -------------------------------------------------
> Hady El-Sahar
> Research Assistant
> Center of Informatics Sciences | Nile University
>
> email : [email protected]
> Phone : +2-01220887311
> http://hadyelsahar.me/
>
>
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] optimization of Language specific links extraction

Reply via email to