Re: [Dbpedia-developers] optimization of Language specific links extraction

Jona Christopher Sahnwaldt Mon, 22 Jul 2013 11:29:38 -0700

As a rough guideline: To process all IL links in the top ~100 languages
(probably 100-200 million links, I don't remember), the JVM needed 6-8 GB.
And -XX:+UseConcMarkSweepGC really helped.
On Jul 22, 2013 8:16 PM, "Jona Christopher Sahnwaldt" <[email protected]>
wrote:


> Hi Hady, all,
>
> some ideas...
>
> I think we can assume that the entries for the Wikidata items come in
> blocks: first all triples for Qx, then all triples for Qy, and so on. In
> general, it may not be trivial to tell where one block ends and the next
> one begins because the URI for Qx may not appear in all triples - there may
> be triples about blank nodes or similar stuff that is only indirectly
> connected to Qx. But I think it won't be very hard either.
>
> The current task becomes easy if you rely on the data comes in blocks:
> keep collecting inter-language links for Qx. Stop when you encounter the
> first triple for Qy. Then just take the data collected from Qx and generate
> new triples. Pseudo-Turtle as follows:
>
> One file with Wikidata subject URIs:
>
> wikidata:Qx sameAs <http://xx.dbpedia.org/resource/Xx> .
> wikidata:Qx sameAs <http:// yy.dbpedia.org/resource/Yy> .
> wikidata:Qx sameAs <http:// zz.dbpedia.org/resource/Zz> .
> ...
>
> And then one file for each language:
>
> xxwiki-same-as.ttl:
>
> <http://xx.dbpedia.org/resource/Xx>
> sameAs wikidata:Qx
> <http://xx.dbpedia.org/resource/Xx>
> sameAs <http://yy.dbpedia.org/resource/Yy>
> <http://xx.dbpedia.org/resource/Xx>
> sameAs <http://zz.dbpedia.org/resource/Zz>
>
> yywiki-same-as.ttl:
>
> <http://yy.dbpedia.org/resource/Yy>
> sameAs wikidata:Qx
> <http://yy.dbpedia.org/resource/Yy>
> sameAs <http://xx.dbpedia.org/resource/Xx>
> <http://yy.dbpedia.org/resource/Yy>
> sameAs <http://zz.dbpedia.org/resource/Zz>
>
> and so on.
>
> The number of triples produced from each block is still quadratic in the
> number of IL links in each block, but that's not a problem. The total
> number of generated triples T (and thus a lower bound for time complexity)
> is O(L*LI), where L is the number of languages and LI is the total number
> of IL links. LI is O(L*W), where W is the total number of Wikidata items,
> so T is O(L^2*W). L is constant and small - 100-300.
>
> But if you really need to load millions of URIs for hundreds of languages
> into RAM, that will probably be possible with some bit-twiddling. For
> DBpedia 3.8, I had to load all IL links into RAM before I could process
> them, so I wrote ProcessInterLanguageLinks.scala, which ran in an hour or
> so. Here's the description from
> https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/scala/org/dbpedia/extraction/scripts/ProcessInterLanguageLinks.scala
>
> Algorithm:
>
>   Each URI is a combination of language code and title string. There are
> only ~12 million
>   unique title strings in the top ~100 languages, so we save space by
> building an index of title
>   strings and using 27 bits (enough for ~130 million titles) of the index
> number instead of the
>   title string. We use 10 bits (enough for 1024 languages) of the language
> index instead of the
>   language code. Taken together, these 37 bits fit into a Long. The lowest
> 27 bits the title code,
>   the next 10 bits are the language code. -1 is used as the null value.
>
>   A link from a page to another page is represented by a Long value which
> contains the
>   concatenation of the values for the page URIs: the upper 27 bits contain
> the title code
>   for the 'from' URI, the next lower 10 bits contain the language code for
> 'to' URI, and the
>   lowest 27 bits contain the title code for the 'to' URI. All links for
> the 'from' language
>   are stored in one array. To find an inverse link, we swap the highest
> and lowest 27 bits,
>   replace the middle 10 bits by the 'from' language, and search the array
> for the 'to' language
>   for the result. To speed up this search, we sort the array and use
> binary search.
>
>   TODO: it's a waste of space to store each character of each title
> separately. Maybe a trie
>   could reduce space requirements.
>
> Cheers,
> JC
> On Jul 22, 2013 6:52 PM, "Dimitris Kontokostas" <
> [email protected]> wrote:
>
>> Hi Hady,
>>
>> Could you make an estimate on the total size of memory that you need for
>> every 1M Wikidata entries? This will give us a better overview.
>> You are free to make assumptions on the average data that you will need
>> (URI size, language #, ...)
>>
>> I 'd also take a look at the "memory-mapped files" for an alternative.
>> I haven't used them with Java/Scala but from searching a little around
>> there is native support which makes them good candidate.
>>
>> Cheers,
>> Dimitris
>>
>>
>> On Sun, Jul 21, 2013 at 5:03 PM, Hady elsahar <[email protected]>wrote:
>>
>>> after some playing around and couple of consultancy on Stackoverflow
>>> here 
>>> <http://stackoverflow.com/questions/17737449/indexing-of-large-text-files-line-by-line-for-fast-access?noredirect=1#comment25863603_17737449>
>>>  and
>>> here<http://stackoverflow.com/questions/17739973/updating-line-in-large-text-file-using-scala/17740460?noredirect=1#17740460>
>>> the bottle neck here is indexing the triples file for the sake of fast
>>> access instead of line by line going through the file
>>>
>>> alternatives available :
>>>
>>> 1- indexing only lines of each subject in the memory + using fixed
>>> length triples lines and Random Access file to acess lines by specific
>>> lines fast
>>> 2- using a key-value store like Redis or something like SQLlight
>>> 3- sorting the file using Merge sort and then we don't need
>>> 4- using Map reduce
>>>
>>> i am implementing the first one and testing it's reliability on large
>>> data , though it seems like a hack but i guess it's suitable cuz it could
>>> be portable and needn't to install any libraries or infrastructure.
>>>
>>>
>>>    - what do you think best thing we should go through ? any other
>>>    suggestion?
>>>    - i always faced such problems and solved them by hacks and
>>>    workarounds but i always wondered what is the state of the art of dealing
>>>    with such problems if there's a standard for that. how do you guys in
>>>    DBpedia tackle such things ?
>>>
>>>
>>> thanks
>>> Regards
>>>
>>>
>>>
>>>
>>> On Thu, Jul 18, 2013 at 10:43 AM, Dimitris Kontokostas <
>>> [email protected]> wrote:
>>>
>>>> Hi Hady,
>>>>
>>>> You could re-use a lot of already defined utility functions for files &
>>>> triple parsing  but you are not so familiar with the framework yet so that
>>>> will come in time
>>>>  See inline for your questions
>>>>
>>>> On Thu, Jul 18, 2013 at 12:57 AM, Hady elsahar 
>>>> <[email protected]>wrote:
>>>>
>>>>> Hello all ,
>>>>>
>>>>> Hoping that everyone is enjoying the summer ,
>>>>>
>>>>> I've written a scala 
>>>>> script<https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala>to
>>>>>  do the task to generate LLlinks specific files to be uploaded as
>>>>> mentioned by JC here 
>>>>> <http://www.mail-archive.com/[email protected]/msg00148.html>
>>>>>
>>>>> option 0 in the script is for extracting the master LL file
>>>>> option 1 is for extracting language specific links files
>>>>>
>>>>> the first iteration of the code is of complexity O(n^2) , where n is
>>>>> the lines in the master LL file ,it seems so Dumb and would take a lot of
>>>>> time when running it on the big dumb, there's a lot of easy ways to
>>>>> optimize this but i had some questions :
>>>>>
>>>>> 1- could we depend that the triples RDF dump will be in order ?
>>>>> ie.(for example all Q1000 entity triples will come after each other and we
>>>>> don't need to parse the rest of the file for related triples )
>>>>>
>>>>
>>>> In general no. If you need them that way you can add a "sort" step in
>>>> the process pipeline
>>>>
>>>>
>>>>> 2- in that task which is better to optimize , memory vs time ?,
>>>>> loading file in a HashMap will optimize the speed a lot , but it may take
>>>>> some memory.
>>>>>
>>>>
>>>> We'd prefer time but it always depends. A few extra GB of memory should
>>>> be acceptable but if you want to load a map with all WikiData entries that
>>>> will not scale well
>>>>
>>>>
>>>>> 3-just for the sake of curiosity and setting standards , the Language
>>>>> links extraction process in wikipedia , how much does it take in terms of
>>>>> time and do we dedicate special server for that ? or it doesn't need it's
>>>>> just a small process ?
>>>>>
>>>>
>>>> It's a small task compared to the wikipedia extraction. In the scale of
>>>> only the language chapters it's around 15-30 minutes. But the initial ILL
>>>> dump is created with the extraction process so it's not directly comparable
>>>>
>>>>
>>>> Best,
>>>> Dimitris
>>>>
>>>>
>>>>> 4- any suggestions could be great
>>>>>
>>>>> thanks
>>>>> Regards
>>>>>
>>>>> -------------------------------------------------
>>>>> Hady El-Sahar
>>>>> Research Assistant
>>>>> Center of Informatics Sciences | Nile 
>>>>> University<http://nileuniversity.edu.eg/>
>>>>>
>>>>> email : [email protected]
>>>>> Phone : +2-01220887311
>>>>> http://hadyelsahar.me/
>>>>>
>>>>> <http://www.linkedin.com/in/hadyelsahar>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> See everything from the browser to the database with AppDynamics
>>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>> Start your free trial of AppDynamics Pro today!
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Dbpedia-developers mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dimitris Kontokostas
>>>> Department of Computer Science, University of Leipzig
>>>> Research Group: http://aksw.org
>>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>>
>>>
>>>
>>>
>>> --
>>> -------------------------------------------------
>>> Hady El-Sahar
>>> Research Assistant
>>> Center of Informatics Sciences | Nile 
>>> University<http://nileuniversity.edu.eg/>
>>>
>>> email : [email protected]
>>> Phone : +2-01220887311
>>> http://hadyelsahar.me/
>>>
>>> <http://www.linkedin.com/in/hadyelsahar>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>>
>>
>> --
>> Dimitris Kontokostas
>> Department of Computer Science, University of Leipzig
>> Research Group: http://aksw.org
>> Homepage:http://aksw.org/DimitrisKontokostas
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] optimization of Language specific links extraction

Reply via email to