Re: [Dbpedia-developers] optimization of Language specific links extraction

Jona Christopher Sahnwaldt Mon, 22 Jul 2013 11:18:51 -0700

Hi Hady, all,

some ideas...


I think we can assume that the entries for the Wikidata items come in
blocks: first all triples for Qx, then all triples for Qy, and so on. In
general, it may not be trivial to tell where one block ends and the next
one begins because the URI for Qx may not appear in all triples - there may
be triples about blank nodes or similar stuff that is only indirectly
connected to Qx. But I think it won't be very hard either.

The current task becomes easy if you rely on the data comes in blocks: keep
collecting inter-language links for Qx. Stop when you encounter the first
triple for Qy. Then just take the data collected from Qx and generate new
triples. Pseudo-Turtle as follows:

One file with Wikidata subject URIs:

wikidata:Qx sameAs <http://xx.dbpedia.org/resource/Xx> .
wikidata:Qx sameAs <http:// yy.dbpedia.org/resource/Yy> .
wikidata:Qx sameAs <http:// zz.dbpedia.org/resource/Zz> .
...

And then one file for each language:

xxwiki-same-as.ttl:

<http://xx.dbpedia.org/resource/Xx>
sameAs wikidata:Qx
<http://xx.dbpedia.org/resource/Xx>
sameAs <http://yy.dbpedia.org/resource/Yy>
<http://xx.dbpedia.org/resource/Xx>
sameAs <http://zz.dbpedia.org/resource/Zz>

yywiki-same-as.ttl:

<http://yy.dbpedia.org/resource/Yy>
sameAs wikidata:Qx
<http://yy.dbpedia.org/resource/Yy>
sameAs <http://xx.dbpedia.org/resource/Xx>
<http://yy.dbpedia.org/resource/Yy>
sameAs <http://zz.dbpedia.org/resource/Zz>

and so on.

The number of triples produced from each block is still quadratic in the
number of IL links in each block, but that's not a problem. The total
number of generated triples T (and thus a lower bound for time complexity)
is O(L*LI), where L is the number of languages and LI is the total number
of IL links. LI is O(L*W), where W is the total number of Wikidata items,
so T is O(L^2*W). L is constant and small - 100-300.

But if you really need to load millions of URIs for hundreds of languages
into RAM, that will probably be possible with some bit-twiddling. For
DBpedia 3.8, I had to load all IL links into RAM before I could process
them, so I wrote ProcessInterLanguageLinks.scala, which ran in an hour or
so. Here's the description from
https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/scala/org/dbpedia/extraction/scripts/ProcessInterLanguageLinks.scala

Algorithm:

  Each URI is a combination of language code and title string. There are
only ~12 million
  unique title strings in the top ~100 languages, so we save space by
building an index of title
  strings and using 27 bits (enough for ~130 million titles) of the index
number instead of the
  title string. We use 10 bits (enough for 1024 languages) of the language
index instead of the
  language code. Taken together, these 37 bits fit into a Long. The lowest
27 bits the title code,
  the next 10 bits are the language code. -1 is used as the null value.

  A link from a page to another page is represented by a Long value which
contains the
  concatenation of the values for the page URIs: the upper 27 bits contain
the title code
  for the 'from' URI, the next lower 10 bits contain the language code for
'to' URI, and the
  lowest 27 bits contain the title code for the 'to' URI. All links for the
'from' language
  are stored in one array. To find an inverse link, we swap the highest and
lowest 27 bits,
  replace the middle 10 bits by the 'from' language, and search the array
for the 'to' language
  for the result. To speed up this search, we sort the array and use binary
search.

  TODO: it's a waste of space to store each character of each title
separately. Maybe a trie
  could reduce space requirements.

Cheers,
JC
On Jul 22, 2013 6:52 PM, "Dimitris Kontokostas" <
[email protected]> wrote:

> Hi Hady,
>
> Could you make an estimate on the total size of memory that you need for
> every 1M Wikidata entries? This will give us a better overview.
> You are free to make assumptions on the average data that you will need
> (URI size, language #, ...)
>
> I 'd also take a look at the "memory-mapped files" for an alternative.
> I haven't used them with Java/Scala but from searching a little around
> there is native support which makes them good candidate.
>
> Cheers,
> Dimitris
>
>
> On Sun, Jul 21, 2013 at 5:03 PM, Hady elsahar <[email protected]>wrote:
>
>> after some playing around and couple of consultancy on Stackoverflow
>> here 
>> <http://stackoverflow.com/questions/17737449/indexing-of-large-text-files-line-by-line-for-fast-access?noredirect=1#comment25863603_17737449>
>>  and
>> here<http://stackoverflow.com/questions/17739973/updating-line-in-large-text-file-using-scala/17740460?noredirect=1#17740460>
>> the bottle neck here is indexing the triples file for the sake of fast
>> access instead of line by line going through the file
>>
>> alternatives available :
>>
>> 1- indexing only lines of each subject in the memory + using fixed length
>> triples lines and Random Access file to acess lines by specific lines fast
>> 2- using a key-value store like Redis or something like SQLlight
>> 3- sorting the file using Merge sort and then we don't need
>> 4- using Map reduce
>>
>> i am implementing the first one and testing it's reliability on large
>> data , though it seems like a hack but i guess it's suitable cuz it could
>> be portable and needn't to install any libraries or infrastructure.
>>
>>
>>    - what do you think best thing we should go through ? any other
>>    suggestion?
>>    - i always faced such problems and solved them by hacks and
>>    workarounds but i always wondered what is the state of the art of dealing
>>    with such problems if there's a standard for that. how do you guys in
>>    DBpedia tackle such things ?
>>
>>
>> thanks
>> Regards
>>
>>
>>
>>
>> On Thu, Jul 18, 2013 at 10:43 AM, Dimitris Kontokostas <
>> [email protected]> wrote:
>>
>>> Hi Hady,
>>>
>>> You could re-use a lot of already defined utility functions for files &
>>> triple parsing  but you are not so familiar with the framework yet so that
>>> will come in time
>>>  See inline for your questions
>>>
>>> On Thu, Jul 18, 2013 at 12:57 AM, Hady elsahar <[email protected]>wrote:
>>>
>>>> Hello all ,
>>>>
>>>> Hoping that everyone is enjoying the summer ,
>>>>
>>>> I've written a scala 
>>>> script<https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala>to
>>>>  do the task to generate LLlinks specific files to be uploaded as
>>>> mentioned by JC here 
>>>> <http://www.mail-archive.com/[email protected]/msg00148.html>
>>>>
>>>> option 0 in the script is for extracting the master LL file
>>>> option 1 is for extracting language specific links files
>>>>
>>>> the first iteration of the code is of complexity O(n^2) , where n is
>>>> the lines in the master LL file ,it seems so Dumb and would take a lot of
>>>> time when running it on the big dumb, there's a lot of easy ways to
>>>> optimize this but i had some questions :
>>>>
>>>> 1- could we depend that the triples RDF dump will be in order ? ie.(for
>>>> example all Q1000 entity triples will come after each other and we don't
>>>> need to parse the rest of the file for related triples )
>>>>
>>>
>>> In general no. If you need them that way you can add a "sort" step in
>>> the process pipeline
>>>
>>>
>>>> 2- in that task which is better to optimize , memory vs time ?, loading
>>>> file in a HashMap will optimize the speed a lot , but it may take some
>>>> memory.
>>>>
>>>
>>> We'd prefer time but it always depends. A few extra GB of memory should
>>> be acceptable but if you want to load a map with all WikiData entries that
>>> will not scale well
>>>
>>>
>>>> 3-just for the sake of curiosity and setting standards , the Language
>>>> links extraction process in wikipedia , how much does it take in terms of
>>>> time and do we dedicate special server for that ? or it doesn't need it's
>>>> just a small process ?
>>>>
>>>
>>> It's a small task compared to the wikipedia extraction. In the scale of
>>> only the language chapters it's around 15-30 minutes. But the initial ILL
>>> dump is created with the extraction process so it's not directly comparable
>>>
>>>
>>> Best,
>>> Dimitris
>>>
>>>
>>>> 4- any suggestions could be great
>>>>
>>>> thanks
>>>> Regards
>>>>
>>>> -------------------------------------------------
>>>> Hady El-Sahar
>>>> Research Assistant
>>>> Center of Informatics Sciences | Nile 
>>>> University<http://nileuniversity.edu.eg/>
>>>>
>>>> email : [email protected]
>>>> Phone : +2-01220887311
>>>> http://hadyelsahar.me/
>>>>
>>>> <http://www.linkedin.com/in/hadyelsahar>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> See everything from the browser to the database with AppDynamics
>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>> Start your free trial of AppDynamics Pro today!
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Dbpedia-developers mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>
>>>>
>>>
>>>
>>> --
>>> Dimitris Kontokostas
>>> Department of Computer Science, University of Leipzig
>>> Research Group: http://aksw.org
>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>
>>
>>
>>
>> --
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile 
>> University<http://nileuniversity.edu.eg/>
>>
>> email : [email protected]
>> Phone : +2-01220887311
>> http://hadyelsahar.me/
>>
>> <http://www.linkedin.com/in/hadyelsahar>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>
>
> --
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig
> Research Group: http://aksw.org
> Homepage:http://aksw.org/DimitrisKontokostas
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] optimization of Language specific links extraction

Reply via email to