Re: [Dbpedia-developers] optimization of Language specific links extraction

Dimitris Kontokostas Wed, 24 Jul 2013 00:12:24 -0700

On Wed, Jul 24, 2013 at 9:42 AM, Hady elsahar <[email protected]> wrote:


> Hello Jona ,
>
> 1- i asked Dimitris considering if Wikidata Items are sorted or not, i
> meant if we can depend if they come in blocks or not maybe i didn't clear
> that enough, if we can depend on that , the problem now is solved but could
> we ? they needn't be sorted like Q1 comes before Q2 , just in blocks
>

yup, sorted is different from blocks :) even a manual sequence of items (
i.e. http://www.wikidata.org/wiki/Special:EntityData/Q1.nt, Q2, ...) will
*not* be sorted because of the interwiki links but will be in blocks


> 2- converting WikiData Dumbs into files is very clever that's very useful
> thing to know thanks, and considering we use memory mapped files and Random
> access files to access files by lines instead of parsing it from the
> beginning this should be fast in conversion .
>
>    - most of the Title Codes are unique so i prefer using no of bits to
>    represent the whole URI and access the URI from the main File by Line
>    number , what would be the problem in that ?
>
>
> so to wrap things up :
>
>
>    - can we depend that Wikidata items come in chunks ? this makes the
>    problem trivial
>
> I also think we can depend on that.
For building the interlanguage links this becomes trivial as Jona said but
I still think that we will need to build an index later.
It's not 100% clear from now what we will need to store then so lets leave
it and deal with it when the time comes.

Cheers,
Dimitris

>
>    - if not :
>       - using Bits to represent Title Codes and languages either ways
>       - processing
>       - replacing Bits with actual DBpedia URI by searching LLfiles and
>       TitleCode
>
> thanks
> Regards
>
>
>
>
> On Mon, Jul 22, 2013 at 8:16 PM, Jona Christopher Sahnwaldt <
> [email protected]> wrote:
>
>> Hi Hady, all,
>>
>> some ideas...
>>
>> I think we can assume that the entries for the Wikidata items come in
>> blocks: first all triples for Qx, then all triples for Qy, and so on. In
>> general, it may not be trivial to tell where one block ends and the next
>> one begins because the URI for Qx may not appear in all triples - there may
>> be triples about blank nodes or similar stuff that is only indirectly
>> connected to Qx. But I think it won't be very hard either.
>>
>> The current task becomes easy if you rely on the data comes in blocks:
>> keep collecting inter-language links for Qx. Stop when you encounter the
>> first triple for Qy. Then just take the data collected from Qx and generate
>> new triples. Pseudo-Turtle as follows:
>>
>> One file with Wikidata subject URIs:
>>
>> wikidata:Qx sameAs <http://xx.dbpedia.org/resource/Xx> .
>> wikidata:Qx sameAs <http:// yy.dbpedia.org/resource/Yy> .
>> wikidata:Qx sameAs <http:// zz.dbpedia.org/resource/Zz> .
>> ...
>>
>> And then one file for each language:
>>
>> xxwiki-same-as.ttl:
>>
>> <http://xx.dbpedia.org/resource/Xx>
>> sameAs wikidata:Qx
>> <http://xx.dbpedia.org/resource/Xx>
>> sameAs <http://yy.dbpedia.org/resource/Yy>
>> <http://xx.dbpedia.org/resource/Xx>
>> sameAs <http://zz.dbpedia.org/resource/Zz>
>>
>> yywiki-same-as.ttl:
>>
>> <http://yy.dbpedia.org/resource/Yy>
>> sameAs wikidata:Qx
>> <http://yy.dbpedia.org/resource/Yy>
>> sameAs <http://xx.dbpedia.org/resource/Xx>
>> <http://yy.dbpedia.org/resource/Yy>
>> sameAs <http://zz.dbpedia.org/resource/Zz>
>>
>> and so on.
>>
>> The number of triples produced from each block is still quadratic in the
>> number of IL links in each block, but that's not a problem. The total
>> number of generated triples T (and thus a lower bound for time complexity)
>> is O(L*LI), where L is the number of languages and LI is the total number
>> of IL links. LI is O(L*W), where W is the total number of Wikidata items,
>> so T is O(L^2*W). L is constant and small - 100-300.
>>
>> But if you really need to load millions of URIs for hundreds of languages
>> into RAM, that will probably be possible with some bit-twiddling. For
>> DBpedia 3.8, I had to load all IL links into RAM before I could process
>> them, so I wrote ProcessInterLanguageLinks.scala, which ran in an hour or
>> so. Here's the description from
>> https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/scala/org/dbpedia/extraction/scripts/ProcessInterLanguageLinks.scala
>>
>> Algorithm:
>>
>>   Each URI is a combination of language code and title string. There are
>> only ~12 million
>>   unique title strings in the top ~100 languages, so we save space by
>> building an index of title
>>   strings and using 27 bits (enough for ~130 million titles) of the index
>> number instead of the
>>   title string. We use 10 bits (enough for 1024 languages) of the
>> language index instead of the
>>   language code. Taken together, these 37 bits fit into a Long. The
>> lowest 27 bits the title code,
>>   the next 10 bits are the language code. -1 is used as the null value.
>>
>>   A link from a page to another page is represented by a Long value which
>> contains the
>>   concatenation of the values for the page URIs: the upper 27 bits
>> contain the title code
>>   for the 'from' URI, the next lower 10 bits contain the language code
>> for 'to' URI, and the
>>   lowest 27 bits contain the title code for the 'to' URI. All links for
>> the 'from' language
>>   are stored in one array. To find an inverse link, we swap the highest
>> and lowest 27 bits,
>>   replace the middle 10 bits by the 'from' language, and search the array
>> for the 'to' language
>>   for the result. To speed up this search, we sort the array and use
>> binary search.
>>
>>   TODO: it's a waste of space to store each character of each title
>> separately. Maybe a trie
>>   could reduce space requirements.
>>
>> Cheers,
>> JC
>> On Jul 22, 2013 6:52 PM, "Dimitris Kontokostas" <
>> [email protected]> wrote:
>>
>>> Hi Hady,
>>>
>>> Could you make an estimate on the total size of memory that you need for
>>> every 1M Wikidata entries? This will give us a better overview.
>>> You are free to make assumptions on the average data that you will need
>>> (URI size, language #, ...)
>>>
>>> I 'd also take a look at the "memory-mapped files" for an alternative.
>>> I haven't used them with Java/Scala but from searching a little around
>>> there is native support which makes them good candidate.
>>>
>>> Cheers,
>>> Dimitris
>>>
>>>
>>> On Sun, Jul 21, 2013 at 5:03 PM, Hady elsahar <[email protected]>wrote:
>>>
>>>> after some playing around and couple of consultancy on Stackoverflow
>>>> here 
>>>> <http://stackoverflow.com/questions/17737449/indexing-of-large-text-files-line-by-line-for-fast-access?noredirect=1#comment25863603_17737449>
>>>>  and
>>>> here<http://stackoverflow.com/questions/17739973/updating-line-in-large-text-file-using-scala/17740460?noredirect=1#17740460>
>>>> the bottle neck here is indexing the triples file for the sake of fast
>>>> access instead of line by line going through the file
>>>>
>>>> alternatives available :
>>>>
>>>> 1- indexing only lines of each subject in the memory + using fixed
>>>> length triples lines and Random Access file to acess lines by specific
>>>> lines fast
>>>> 2- using a key-value store like Redis or something like SQLlight
>>>> 3- sorting the file using Merge sort and then we don't need
>>>> 4- using Map reduce
>>>>
>>>> i am implementing the first one and testing it's reliability on large
>>>> data , though it seems like a hack but i guess it's suitable cuz it could
>>>> be portable and needn't to install any libraries or infrastructure.
>>>>
>>>>
>>>>    - what do you think best thing we should go through ? any other
>>>>    suggestion?
>>>>    - i always faced such problems and solved them by hacks and
>>>>    workarounds but i always wondered what is the state of the art of 
>>>> dealing
>>>>    with such problems if there's a standard for that. how do you guys in
>>>>    DBpedia tackle such things ?
>>>>
>>>>
>>>> thanks
>>>> Regards
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 18, 2013 at 10:43 AM, Dimitris Kontokostas <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Hady,
>>>>>
>>>>> You could re-use a lot of already defined utility functions for files
>>>>> & triple parsing  but you are not so familiar with the framework yet so
>>>>> that will come in time
>>>>>  See inline for your questions
>>>>>
>>>>> On Thu, Jul 18, 2013 at 12:57 AM, Hady elsahar 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Hello all ,
>>>>>>
>>>>>> Hoping that everyone is enjoying the summer ,
>>>>>>
>>>>>> I've written a scala 
>>>>>> script<https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala>to
>>>>>>  do the task to generate LLlinks specific files to be uploaded as
>>>>>> mentioned by JC here 
>>>>>> <http://www.mail-archive.com/[email protected]/msg00148.html>
>>>>>>
>>>>>> option 0 in the script is for extracting the master LL file
>>>>>> option 1 is for extracting language specific links files
>>>>>>
>>>>>> the first iteration of the code is of complexity O(n^2) , where n is
>>>>>> the lines in the master LL file ,it seems so Dumb and would take a lot of
>>>>>> time when running it on the big dumb, there's a lot of easy ways to
>>>>>> optimize this but i had some questions :
>>>>>>
>>>>>> 1- could we depend that the triples RDF dump will be in order ?
>>>>>> ie.(for example all Q1000 entity triples will come after each other and 
>>>>>> we
>>>>>> don't need to parse the rest of the file for related triples )
>>>>>>
>>>>>
>>>>> In general no. If you need them that way you can add a "sort" step in
>>>>> the process pipeline
>>>>>
>>>>>
>>>>>> 2- in that task which is better to optimize , memory vs time ?,
>>>>>> loading file in a HashMap will optimize the speed a lot , but it may take
>>>>>> some memory.
>>>>>>
>>>>>
>>>>> We'd prefer time but it always depends. A few extra GB of memory
>>>>> should be acceptable but if you want to load a map with all WikiData
>>>>> entries that will not scale well
>>>>>
>>>>>
>>>>>> 3-just for the sake of curiosity and setting standards , the Language
>>>>>> links extraction process in wikipedia , how much does it take in terms of
>>>>>> time and do we dedicate special server for that ? or it doesn't need it's
>>>>>> just a small process ?
>>>>>>
>>>>>
>>>>> It's a small task compared to the wikipedia extraction. In the scale
>>>>> of only the language chapters it's around 15-30 minutes. But the initial
>>>>> ILL dump is created with the extraction process so it's not directly
>>>>> comparable
>>>>>
>>>>>
>>>>> Best,
>>>>> Dimitris
>>>>>
>>>>>
>>>>>>  4- any suggestions could be great
>>>>>>
>>>>>> thanks
>>>>>> Regards
>>>>>>
>>>>>> -------------------------------------------------
>>>>>> Hady El-Sahar
>>>>>> Research Assistant
>>>>>> Center of Informatics Sciences | Nile 
>>>>>> University<http://nileuniversity.edu.eg/>
>>>>>>
>>>>>> email : [email protected]
>>>>>> Phone : +2-01220887311
>>>>>> http://hadyelsahar.me/
>>>>>>
>>>>>> <http://www.linkedin.com/in/hadyelsahar>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> See everything from the browser to the database with AppDynamics
>>>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>> _______________________________________________
>>>>>> Dbpedia-developers mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dimitris Kontokostas
>>>>> Department of Computer Science, University of Leipzig
>>>>> Research Group: http://aksw.org
>>>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -------------------------------------------------
>>>> Hady El-Sahar
>>>> Research Assistant
>>>> Center of Informatics Sciences | Nile 
>>>> University<http://nileuniversity.edu.eg/>
>>>>
>>>> email : [email protected]
>>>> Phone : +2-01220887311
>>>> http://hadyelsahar.me/
>>>>
>>>> <http://www.linkedin.com/in/hadyelsahar>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> See everything from the browser to the database with AppDynamics
>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>> Start your free trial of AppDynamics Pro today!
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Dbpedia-developers mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>
>>>>
>>>
>>>
>>> --
>>> Dimitris Kontokostas
>>> Department of Computer Science, University of Leipzig
>>> Research Group: http://aksw.org
>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>
>
> --
> -------------------------------------------------
> Hady El-Sahar
> Research Assistant
> Center of Informatics Sciences | Nile 
> University<http://nileuniversity.edu.eg/>
>
> email : [email protected]
> Phone : +2-01220887311
> http://hadyelsahar.me/
>
> <http://www.linkedin.com/in/hadyelsahar>
>
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>


-- 
Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Homepage:http://aksw.org/DimitrisKontokostas

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] optimization of Language specific links extraction

Reply via email to