Re: [Dbpedia-developers] optimization of Language specific links extraction

Hady elsahar Tue, 23 Jul 2013 23:43:23 -0700

Hello Jona ,

1- i asked Dimitris considering if Wikidata Items are sorted or not, i
meant if we can depend if they come in blocks or not maybe i didn't clear
that enough, if we can depend on that , the problem now is solved but could
we ? they needn't be sorted like Q1 comes before Q2 , just in blocks



2- converting WikiData Dumbs into files is very clever that's very useful
thing to know thanks, and considering we use memory mapped files and Random
access files to access files by lines instead of parsing it from the
beginning this should be fast in conversion .

   - most of the Title Codes are unique so i prefer using no of bits to
   represent the whole URI and access the URI from the main File by Line
   number , what would be the problem in that ?


so to wrap things up :


   - can we depend that Wikidata items come in chunks ? this makes the
   problem trivial
   - if not :
      - using Bits to represent Title Codes and languages either ways
      - processing
      - replacing Bits with actual DBpedia URI by searching LLfiles and
      TitleCode

thanks
Regards



On Mon, Jul 22, 2013 at 8:16 PM, Jona Christopher Sahnwaldt <[email protected]
> wrote:

> Hi Hady, all,
>
> some ideas...
>
> I think we can assume that the entries for the Wikidata items come in
> blocks: first all triples for Qx, then all triples for Qy, and so on. In
> general, it may not be trivial to tell where one block ends and the next
> one begins because the URI for Qx may not appear in all triples - there may
> be triples about blank nodes or similar stuff that is only indirectly
> connected to Qx. But I think it won't be very hard either.
>
> The current task becomes easy if you rely on the data comes in blocks:
> keep collecting inter-language links for Qx. Stop when you encounter the
> first triple for Qy. Then just take the data collected from Qx and generate
> new triples. Pseudo-Turtle as follows:
>
> One file with Wikidata subject URIs:
>
> wikidata:Qx sameAs <http://xx.dbpedia.org/resource/Xx> .
> wikidata:Qx sameAs <http:// yy.dbpedia.org/resource/Yy> .
> wikidata:Qx sameAs <http:// zz.dbpedia.org/resource/Zz> .
> ...
>
> And then one file for each language:
>
> xxwiki-same-as.ttl:
>
> <http://xx.dbpedia.org/resource/Xx>
> sameAs wikidata:Qx
> <http://xx.dbpedia.org/resource/Xx>
> sameAs <http://yy.dbpedia.org/resource/Yy>
> <http://xx.dbpedia.org/resource/Xx>
> sameAs <http://zz.dbpedia.org/resource/Zz>
>
> yywiki-same-as.ttl:
>
> <http://yy.dbpedia.org/resource/Yy>
> sameAs wikidata:Qx
> <http://yy.dbpedia.org/resource/Yy>
> sameAs <http://xx.dbpedia.org/resource/Xx>
> <http://yy.dbpedia.org/resource/Yy>
> sameAs <http://zz.dbpedia.org/resource/Zz>
>
> and so on.
>
> The number of triples produced from each block is still quadratic in the
> number of IL links in each block, but that's not a problem. The total
> number of generated triples T (and thus a lower bound for time complexity)
> is O(L*LI), where L is the number of languages and LI is the total number
> of IL links. LI is O(L*W), where W is the total number of Wikidata items,
> so T is O(L^2*W). L is constant and small - 100-300.
>
> But if you really need to load millions of URIs for hundreds of languages
> into RAM, that will probably be possible with some bit-twiddling. For
> DBpedia 3.8, I had to load all IL links into RAM before I could process
> them, so I wrote ProcessInterLanguageLinks.scala, which ran in an hour or
> so. Here's the description from
> https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/scala/org/dbpedia/extraction/scripts/ProcessInterLanguageLinks.scala
>
> Algorithm:
>
>   Each URI is a combination of language code and title string. There are
> only ~12 million
>   unique title strings in the top ~100 languages, so we save space by
> building an index of title
>   strings and using 27 bits (enough for ~130 million titles) of the index
> number instead of the
>   title string. We use 10 bits (enough for 1024 languages) of the language
> index instead of the
>   language code. Taken together, these 37 bits fit into a Long. The lowest
> 27 bits the title code,
>   the next 10 bits are the language code. -1 is used as the null value.
>
>   A link from a page to another page is represented by a Long value which
> contains the
>   concatenation of the values for the page URIs: the upper 27 bits contain
> the title code
>   for the 'from' URI, the next lower 10 bits contain the language code for
> 'to' URI, and the
>   lowest 27 bits contain the title code for the 'to' URI. All links for
> the 'from' language
>   are stored in one array. To find an inverse link, we swap the highest
> and lowest 27 bits,
>   replace the middle 10 bits by the 'from' language, and search the array
> for the 'to' language
>   for the result. To speed up this search, we sort the array and use
> binary search.
>
>   TODO: it's a waste of space to store each character of each title
> separately. Maybe a trie
>   could reduce space requirements.
>
> Cheers,
> JC
> On Jul 22, 2013 6:52 PM, "Dimitris Kontokostas" <
> [email protected]> wrote:
>
>> Hi Hady,
>>
>> Could you make an estimate on the total size of memory that you need for
>> every 1M Wikidata entries? This will give us a better overview.
>> You are free to make assumptions on the average data that you will need
>> (URI size, language #, ...)
>>
>> I 'd also take a look at the "memory-mapped files" for an alternative.
>> I haven't used them with Java/Scala but from searching a little around
>> there is native support which makes them good candidate.
>>
>> Cheers,
>> Dimitris
>>
>>
>> On Sun, Jul 21, 2013 at 5:03 PM, Hady elsahar <[email protected]>wrote:
>>
>>> after some playing around and couple of consultancy on Stackoverflow
>>> here 
>>> <http://stackoverflow.com/questions/17737449/indexing-of-large-text-files-line-by-line-for-fast-access?noredirect=1#comment25863603_17737449>
>>>  and
>>> here<http://stackoverflow.com/questions/17739973/updating-line-in-large-text-file-using-scala/17740460?noredirect=1#17740460>
>>> the bottle neck here is indexing the triples file for the sake of fast
>>> access instead of line by line going through the file
>>>
>>> alternatives available :
>>>
>>> 1- indexing only lines of each subject in the memory + using fixed
>>> length triples lines and Random Access file to acess lines by specific
>>> lines fast
>>> 2- using a key-value store like Redis or something like SQLlight
>>> 3- sorting the file using Merge sort and then we don't need
>>> 4- using Map reduce
>>>
>>> i am implementing the first one and testing it's reliability on large
>>> data , though it seems like a hack but i guess it's suitable cuz it could
>>> be portable and needn't to install any libraries or infrastructure.
>>>
>>>
>>>    - what do you think best thing we should go through ? any other
>>>    suggestion?
>>>    - i always faced such problems and solved them by hacks and
>>>    workarounds but i always wondered what is the state of the art of dealing
>>>    with such problems if there's a standard for that. how do you guys in
>>>    DBpedia tackle such things ?
>>>
>>>
>>> thanks
>>> Regards
>>>
>>>
>>>
>>>
>>> On Thu, Jul 18, 2013 at 10:43 AM, Dimitris Kontokostas <
>>> [email protected]> wrote:
>>>
>>>> Hi Hady,
>>>>
>>>> You could re-use a lot of already defined utility functions for files &
>>>> triple parsing  but you are not so familiar with the framework yet so that
>>>> will come in time
>>>>  See inline for your questions
>>>>
>>>> On Thu, Jul 18, 2013 at 12:57 AM, Hady elsahar 
>>>> <[email protected]>wrote:
>>>>
>>>>> Hello all ,
>>>>>
>>>>> Hoping that everyone is enjoying the summer ,
>>>>>
>>>>> I've written a scala 
>>>>> script<https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala>to
>>>>>  do the task to generate LLlinks specific files to be uploaded as
>>>>> mentioned by JC here 
>>>>> <http://www.mail-archive.com/[email protected]/msg00148.html>
>>>>>
>>>>> option 0 in the script is for extracting the master LL file
>>>>> option 1 is for extracting language specific links files
>>>>>
>>>>> the first iteration of the code is of complexity O(n^2) , where n is
>>>>> the lines in the master LL file ,it seems so Dumb and would take a lot of
>>>>> time when running it on the big dumb, there's a lot of easy ways to
>>>>> optimize this but i had some questions :
>>>>>
>>>>> 1- could we depend that the triples RDF dump will be in order ?
>>>>> ie.(for example all Q1000 entity triples will come after each other and we
>>>>> don't need to parse the rest of the file for related triples )
>>>>>
>>>>
>>>> In general no. If you need them that way you can add a "sort" step in
>>>> the process pipeline
>>>>
>>>>
>>>>> 2- in that task which is better to optimize , memory vs time ?,
>>>>> loading file in a HashMap will optimize the speed a lot , but it may take
>>>>> some memory.
>>>>>
>>>>
>>>> We'd prefer time but it always depends. A few extra GB of memory should
>>>> be acceptable but if you want to load a map with all WikiData entries that
>>>> will not scale well
>>>>
>>>>
>>>>> 3-just for the sake of curiosity and setting standards , the Language
>>>>> links extraction process in wikipedia , how much does it take in terms of
>>>>> time and do we dedicate special server for that ? or it doesn't need it's
>>>>> just a small process ?
>>>>>
>>>>
>>>> It's a small task compared to the wikipedia extraction. In the scale of
>>>> only the language chapters it's around 15-30 minutes. But the initial ILL
>>>> dump is created with the extraction process so it's not directly comparable
>>>>
>>>>
>>>> Best,
>>>> Dimitris
>>>>
>>>>
>>>>> 4- any suggestions could be great
>>>>>
>>>>> thanks
>>>>> Regards
>>>>>
>>>>> -------------------------------------------------
>>>>> Hady El-Sahar
>>>>> Research Assistant
>>>>> Center of Informatics Sciences | Nile 
>>>>> University<http://nileuniversity.edu.eg/>
>>>>>
>>>>> email : [email protected]
>>>>> Phone : +2-01220887311
>>>>> http://hadyelsahar.me/
>>>>>
>>>>> <http://www.linkedin.com/in/hadyelsahar>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> See everything from the browser to the database with AppDynamics
>>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>> Start your free trial of AppDynamics Pro today!
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Dbpedia-developers mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dimitris Kontokostas
>>>> Department of Computer Science, University of Leipzig
>>>> Research Group: http://aksw.org
>>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>>
>>>
>>>
>>>
>>> --
>>> -------------------------------------------------
>>> Hady El-Sahar
>>> Research Assistant
>>> Center of Informatics Sciences | Nile 
>>> University<http://nileuniversity.edu.eg/>
>>>
>>> email : [email protected]
>>> Phone : +2-01220887311
>>> http://hadyelsahar.me/
>>>
>>> <http://www.linkedin.com/in/hadyelsahar>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>>
>>
>> --
>> Dimitris Kontokostas
>> Department of Computer Science, University of Leipzig
>> Research Group: http://aksw.org
>> Homepage:http://aksw.org/DimitrisKontokostas
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>


-- 
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>

email : [email protected]
Phone : +2-01220887311
http://hadyelsahar.me/

<http://www.linkedin.com/in/hadyelsahar>

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] optimization of Language specific links extraction

Reply via email to