Re: [Dbpedia-developers] optimization of Language specific links extraction

Hady elsahar Wed, 24 Jul 2013 10:59:27 -0700

Sending again because files where too large to upload on the mailing list

That's Great ,
i implemented it Here:
https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala


by small profiling i found that 96% of the time taken is in the HardDisk
access , so any optimization for scaling would need to optimize the hard
disk access method.

some attached output  files for 1K wikiData items :
http://db.tt/9jekUq51
http://db.tt/SE3u0AKD
http://db.tt/GLoBMRua





thanks
Regards

On Wed, Jul 24, 2013 at 9:10 AM, Dimitris Kontokostas <
[email protected]> wrote:

>
>
>
> On Wed, Jul 24, 2013 at 9:42 AM, Hady elsahar <[email protected]>wrote:
>
>> Hello Jona ,
>>
>> 1- i asked Dimitris considering if Wikidata Items are sorted or not, i
>> meant if we can depend if they come in blocks or not maybe i didn't clear
>> that enough, if we can depend on that , the problem now is solved but could
>> we ? they needn't be sorted like Q1 comes before Q2 , just in blocks
>>
>
> yup, sorted is different from blocks :) even a manual sequence of items (
> i.e. http://www.wikidata.org/wiki/Special:EntityData/Q1.nt, Q2, ...) will
> *not* be sorted because of the interwiki links but will be in blocks
>
>
>> 2- converting WikiData Dumbs into files is very clever that's very useful
>> thing to know thanks, and considering we use memory mapped files and Random
>> access files to access files by lines instead of parsing it from the
>> beginning this should be fast in conversion .
>>
>>    - most of the Title Codes are unique so i prefer using no of bits to
>>    represent the whole URI and access the URI from the main File by Line
>>    number , what would be the problem in that ?
>>
>>
>> so to wrap things up :
>>
>>
>>    - can we depend that Wikidata items come in chunks ? this makes the
>>    problem trivial
>>
>> I also think we can depend on that.
> For building the interlanguage links this becomes trivial as Jona said but
> I still think that we will need to build an index later.
> It's not 100% clear from now what we will need to store then so lets leave
> it and deal with it when the time comes.
>
> Cheers,
> Dimitris
>
>>
>>    - if not :
>>       - using Bits to represent Title Codes and languages either ways
>>       - processing
>>       - replacing Bits with actual DBpedia URI by searching LLfiles and
>>       TitleCode
>>
>> thanks
>> Regards
>>
>>
>>
>>
>> On Mon, Jul 22, 2013 at 8:16 PM, Jona Christopher Sahnwaldt <
>> [email protected]> wrote:
>>
>>> Hi Hady, all,
>>>
>>> some ideas...
>>>
>>> I think we can assume that the entries for the Wikidata items come in
>>> blocks: first all triples for Qx, then all triples for Qy, and so on. In
>>> general, it may not be trivial to tell where one block ends and the next
>>> one begins because the URI for Qx may not appear in all triples - there may
>>> be triples about blank nodes or similar stuff that is only indirectly
>>> connected to Qx. But I think it won't be very hard either.
>>>
>>> The current task becomes easy if you rely on the data comes in blocks:
>>> keep collecting inter-language links for Qx. Stop when you encounter the
>>> first triple for Qy. Then just take the data collected from Qx and generate
>>> new triples. Pseudo-Turtle as follows:
>>>
>>> One file with Wikidata subject URIs:
>>>
>>> wikidata:Qx sameAs <http://xx.dbpedia.org/resource/Xx> .
>>> wikidata:Qx sameAs <http:// yy.dbpedia.org/resource/Yy> .
>>> wikidata:Qx sameAs <http:// zz.dbpedia.org/resource/Zz> .
>>> ...
>>>
>>> And then one file for each language:
>>>
>>> xxwiki-same-as.ttl:
>>>
>>> <http://xx.dbpedia.org/resource/Xx>
>>> sameAs wikidata:Qx
>>> <http://xx.dbpedia.org/resource/Xx>
>>> sameAs <http://yy.dbpedia.org/resource/Yy>
>>> <http://xx.dbpedia.org/resource/Xx>
>>> sameAs <http://zz.dbpedia.org/resource/Zz>
>>>
>>> yywiki-same-as.ttl:
>>>
>>> <http://yy.dbpedia.org/resource/Yy>
>>> sameAs wikidata:Qx
>>> <http://yy.dbpedia.org/resource/Yy>
>>> sameAs <http://xx.dbpedia.org/resource/Xx>
>>> <http://yy.dbpedia.org/resource/Yy>
>>> sameAs <http://zz.dbpedia.org/resource/Zz>
>>>
>>> and so on.
>>>
>>> The number of triples produced from each block is still quadratic in the
>>> number of IL links in each block, but that's not a problem. The total
>>> number of generated triples T (and thus a lower bound for time complexity)
>>> is O(L*LI), where L is the number of languages and LI is the total number
>>> of IL links. LI is O(L*W), where W is the total number of Wikidata items,
>>> so T is O(L^2*W). L is constant and small - 100-300.
>>>
>>> But if you really need to load millions of URIs for hundreds of
>>> languages into RAM, that will probably be possible with some bit-twiddling.
>>> For DBpedia 3.8, I had to load all IL links into RAM before I could process
>>> them, so I wrote ProcessInterLanguageLinks.scala, which ran in an hour or
>>> so. Here's the description from
>>> https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/scala/org/dbpedia/extraction/scripts/ProcessInterLanguageLinks.scala
>>>
>>> Algorithm:
>>>
>>>   Each URI is a combination of language code and title string. There are
>>> only ~12 million
>>>   unique title strings in the top ~100 languages, so we save space by
>>> building an index of title
>>>   strings and using 27 bits (enough for ~130 million titles) of the
>>> index number instead of the
>>>   title string. We use 10 bits (enough for 1024 languages) of the
>>> language index instead of the
>>>   language code. Taken together, these 37 bits fit into a Long. The
>>> lowest 27 bits the title code,
>>>   the next 10 bits are the language code. -1 is used as the null value.
>>>
>>>   A link from a page to another page is represented by a Long value
>>> which contains the
>>>   concatenation of the values for the page URIs: the upper 27 bits
>>> contain the title code
>>>   for the 'from' URI, the next lower 10 bits contain the language code
>>> for 'to' URI, and the
>>>   lowest 27 bits contain the title code for the 'to' URI. All links for
>>> the 'from' language
>>>   are stored in one array. To find an inverse link, we swap the highest
>>> and lowest 27 bits,
>>>   replace the middle 10 bits by the 'from' language, and search the
>>> array for the 'to' language
>>>   for the result. To speed up this search, we sort the array and use
>>> binary search.
>>>
>>>   TODO: it's a waste of space to store each character of each title
>>> separately. Maybe a trie
>>>   could reduce space requirements.
>>>
>>> Cheers,
>>> JC
>>> On Jul 22, 2013 6:52 PM, "Dimitris Kontokostas" <
>>> [email protected]> wrote:
>>>
>>>> Hi Hady,
>>>>
>>>> Could you make an estimate on the total size of memory that you need
>>>> for every 1M Wikidata entries? This will give us a better overview.
>>>> You are free to make assumptions on the average data that you will need
>>>> (URI size, language #, ...)
>>>>
>>>> I 'd also take a look at the "memory-mapped files" for an alternative.
>>>> I haven't used them with Java/Scala but from searching a little around
>>>> there is native support which makes them good candidate.
>>>>
>>>> Cheers,
>>>> Dimitris
>>>>
>>>>
>>>> On Sun, Jul 21, 2013 at 5:03 PM, Hady elsahar <[email protected]>wrote:
>>>>
>>>>> after some playing around and couple of consultancy on Stackoverflow
>>>>> here 
>>>>> <http://stackoverflow.com/questions/17737449/indexing-of-large-text-files-line-by-line-for-fast-access?noredirect=1#comment25863603_17737449>
>>>>>  and
>>>>> here<http://stackoverflow.com/questions/17739973/updating-line-in-large-text-file-using-scala/17740460?noredirect=1#17740460>
>>>>> the bottle neck here is indexing the triples file for the sake of fast
>>>>> access instead of line by line going through the file
>>>>>
>>>>> alternatives available :
>>>>>
>>>>> 1- indexing only lines of each subject in the memory + using fixed
>>>>> length triples lines and Random Access file to acess lines by specific
>>>>> lines fast
>>>>> 2- using a key-value store like Redis or something like SQLlight
>>>>> 3- sorting the file using Merge sort and then we don't need
>>>>> 4- using Map reduce
>>>>>
>>>>> i am implementing the first one and testing it's reliability on large
>>>>> data , though it seems like a hack but i guess it's suitable cuz it could
>>>>> be portable and needn't to install any libraries or infrastructure.
>>>>>
>>>>>
>>>>>    - what do you think best thing we should go through ? any other
>>>>>    suggestion?
>>>>>    - i always faced such problems and solved them by hacks and
>>>>>    workarounds but i always wondered what is the state of the art of 
>>>>> dealing
>>>>>    with such problems if there's a standard for that. how do you guys in
>>>>>    DBpedia tackle such things ?
>>>>>
>>>>>
>>>>> thanks
>>>>> Regards
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 18, 2013 at 10:43 AM, Dimitris Kontokostas <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Hady,
>>>>>>
>>>>>> You could re-use a lot of already defined utility functions for files
>>>>>> & triple parsing  but you are not so familiar with the framework yet so
>>>>>> that will come in time
>>>>>>  See inline for your questions
>>>>>>
>>>>>> On Thu, Jul 18, 2013 at 12:57 AM, Hady elsahar <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Hello all ,
>>>>>>>
>>>>>>> Hoping that everyone is enjoying the summer ,
>>>>>>>
>>>>>>> I've written a scala 
>>>>>>> script<https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala>to
>>>>>>>  do the task to generate LLlinks specific files to be uploaded as
>>>>>>> mentioned by JC here 
>>>>>>> <http://www.mail-archive.com/[email protected]/msg00148.html>
>>>>>>>
>>>>>>> option 0 in the script is for extracting the master LL file
>>>>>>> option 1 is for extracting language specific links files
>>>>>>>
>>>>>>> the first iteration of the code is of complexity O(n^2) , where n is
>>>>>>> the lines in the master LL file ,it seems so Dumb and would take a lot 
>>>>>>> of
>>>>>>> time when running it on the big dumb, there's a lot of easy ways to
>>>>>>> optimize this but i had some questions :
>>>>>>>
>>>>>>> 1- could we depend that the triples RDF dump will be in order ?
>>>>>>> ie.(for example all Q1000 entity triples will come after each other and 
>>>>>>> we
>>>>>>> don't need to parse the rest of the file for related triples )
>>>>>>>
>>>>>>
>>>>>> In general no. If you need them that way you can add a "sort" step in
>>>>>> the process pipeline
>>>>>>
>>>>>>
>>>>>>> 2- in that task which is better to optimize , memory vs time ?,
>>>>>>> loading file in a HashMap will optimize the speed a lot , but it may 
>>>>>>> take
>>>>>>> some memory.
>>>>>>>
>>>>>>
>>>>>> We'd prefer time but it always depends. A few extra GB of memory
>>>>>> should be acceptable but if you want to load a map with all WikiData
>>>>>> entries that will not scale well
>>>>>>
>>>>>>
>>>>>>> 3-just for the sake of curiosity and setting standards , the
>>>>>>> Language links extraction process in wikipedia , how much does it take 
>>>>>>> in
>>>>>>> terms of time and do we dedicate special server for that ? or it doesn't
>>>>>>> need it's just a small process ?
>>>>>>>
>>>>>>
>>>>>> It's a small task compared to the wikipedia extraction. In the scale
>>>>>> of only the language chapters it's around 15-30 minutes. But the initial
>>>>>> ILL dump is created with the extraction process so it's not directly
>>>>>> comparable
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Dimitris
>>>>>>
>>>>>>
>>>>>>>  4- any suggestions could be great
>>>>>>>
>>>>>>> thanks
>>>>>>> Regards
>>>>>>>
>>>>>>> -------------------------------------------------
>>>>>>> Hady El-Sahar
>>>>>>> Research Assistant
>>>>>>> Center of Informatics Sciences | Nile 
>>>>>>> University<http://nileuniversity.edu.eg/>
>>>>>>>
>>>>>>> email : [email protected]
>>>>>>> Phone : +2-01220887311
>>>>>>> http://hadyelsahar.me/
>>>>>>>
>>>>>>> <http://www.linkedin.com/in/hadyelsahar>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> See everything from the browser to the database with AppDynamics
>>>>>>> Get end-to-end visibility with application monitoring from
>>>>>>> AppDynamics
>>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>>
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>>> _______________________________________________
>>>>>>> Dbpedia-developers mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dimitris Kontokostas
>>>>>> Department of Computer Science, University of Leipzig
>>>>>> Research Group: http://aksw.org
>>>>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -------------------------------------------------
>>>>> Hady El-Sahar
>>>>> Research Assistant
>>>>> Center of Informatics Sciences | Nile 
>>>>> University<http://nileuniversity.edu.eg/>
>>>>>
>>>>> email : [email protected]
>>>>> Phone : +2-01220887311
>>>>> http://hadyelsahar.me/
>>>>>
>>>>> <http://www.linkedin.com/in/hadyelsahar>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> See everything from the browser to the database with AppDynamics
>>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>> Start your free trial of AppDynamics Pro today!
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Dbpedia-developers mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dimitris Kontokostas
>>>> Department of Computer Science, University of Leipzig
>>>> Research Group: http://aksw.org
>>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> See everything from the browser to the database with AppDynamics
>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>> Start your free trial of AppDynamics Pro today!
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Dbpedia-developers mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>
>>>>
>>
>>
>> --
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile 
>> University<http://nileuniversity.edu.eg/>
>>
>> email : [email protected]
>> Phone : +2-01220887311
>> http://hadyelsahar.me/
>>
>> <http://www.linkedin.com/in/hadyelsahar>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>
>
> --
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig
> Research Group: http://aksw.org
> Homepage:http://aksw.org/DimitrisKontokostas
>



-- 
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>

email : [email protected]
Phone : +2-01220887311
http://hadyelsahar.me/

<http://www.linkedin.com/in/hadyelsahar>

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] optimization of Language specific links extraction

Reply via email to