Re: [Dbpedia-developers] optimization of Language specific links extraction

Hady elsahar Tue, 23 Jul 2013 22:59:29 -0700

Hello Dimitris,

for a memory-mapped files , i used to use the path  /dev/shm , for that
it's a temp path for shared memory between processes and it's kept in the
RAM , it's volatile so we should copy files out of this folder at the end
of the process.


*for memory estimation *
it depends on how much we can optimize in that ,
the master LL files is around 25K for each entity , considering around 12M
entities in Wikidata ( i checked the largest Qid not sure of the number) ,
this makes it a very large quantity around 300GB

however if we are going to index less info: (like the line number in my
case , or some Bit Code as JC suggested )
this is going to be reduced a lot to some hundred of Megabytes only but
then it wont be that straight forward ie, (using random access files or any
such thing).








On Mon, Jul 22, 2013 at 6:51 PM, Dimitris Kontokostas <
[email protected]> wrote:

> Hi Hady,
>
> Could you make an estimate on the total size of memory that you need for
> every 1M Wikidata entries? This will give us a better overview.
> You are free to make assumptions on the average data that you will need
> (URI size, language #, ...)
>
> I 'd also take a look at the "memory-mapped files" for an alternative.
> I haven't used them with Java/Scala but from searching a little around
> there is native support which makes them good candidate.
>
> Cheers,
> Dimitris
>
>
> On Sun, Jul 21, 2013 at 5:03 PM, Hady elsahar <[email protected]>wrote:
>
>> after some playing around and couple of consultancy on Stackoverflow
>> here 
>> <http://stackoverflow.com/questions/17737449/indexing-of-large-text-files-line-by-line-for-fast-access?noredirect=1#comment25863603_17737449>
>>  and
>> here<http://stackoverflow.com/questions/17739973/updating-line-in-large-text-file-using-scala/17740460?noredirect=1#17740460>
>> the bottle neck here is indexing the triples file for the sake of fast
>> access instead of line by line going through the file
>>
>> alternatives available :
>>
>> 1- indexing only lines of each subject in the memory + using fixed length
>> triples lines and Random Access file to acess lines by specific lines fast
>> 2- using a key-value store like Redis or something like SQLlight
>> 3- sorting the file using Merge sort and then we don't need
>> 4- using Map reduce
>>
>> i am implementing the first one and testing it's reliability on large
>> data , though it seems like a hack but i guess it's suitable cuz it could
>> be portable and needn't to install any libraries or infrastructure.
>>
>>
>>    - what do you think best thing we should go through ? any other
>>    suggestion?
>>    - i always faced such problems and solved them by hacks and
>>    workarounds but i always wondered what is the state of the art of dealing
>>    with such problems if there's a standard for that. how do you guys in
>>    DBpedia tackle such things ?
>>
>>
>> thanks
>> Regards
>>
>>
>>
>>
>> On Thu, Jul 18, 2013 at 10:43 AM, Dimitris Kontokostas <
>> [email protected]> wrote:
>>
>>> Hi Hady,
>>>
>>> You could re-use a lot of already defined utility functions for files &
>>> triple parsing  but you are not so familiar with the framework yet so that
>>> will come in time
>>>  See inline for your questions
>>>
>>> On Thu, Jul 18, 2013 at 12:57 AM, Hady elsahar <[email protected]>wrote:
>>>
>>>> Hello all ,
>>>>
>>>> Hoping that everyone is enjoying the summer ,
>>>>
>>>> I've written a scala 
>>>> script<https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala>to
>>>>  do the task to generate LLlinks specific files to be uploaded as
>>>> mentioned by JC here 
>>>> <http://www.mail-archive.com/[email protected]/msg00148.html>
>>>>
>>>> option 0 in the script is for extracting the master LL file
>>>> option 1 is for extracting language specific links files
>>>>
>>>> the first iteration of the code is of complexity O(n^2) , where n is
>>>> the lines in the master LL file ,it seems so Dumb and would take a lot of
>>>> time when running it on the big dumb, there's a lot of easy ways to
>>>> optimize this but i had some questions :
>>>>
>>>> 1- could we depend that the triples RDF dump will be in order ? ie.(for
>>>> example all Q1000 entity triples will come after each other and we don't
>>>> need to parse the rest of the file for related triples )
>>>>
>>>
>>> In general no. If you need them that way you can add a "sort" step in
>>> the process pipeline
>>>
>>>
>>>> 2- in that task which is better to optimize , memory vs time ?, loading
>>>> file in a HashMap will optimize the speed a lot , but it may take some
>>>> memory.
>>>>
>>>
>>> We'd prefer time but it always depends. A few extra GB of memory should
>>> be acceptable but if you want to load a map with all WikiData entries that
>>> will not scale well
>>>
>>>
>>>> 3-just for the sake of curiosity and setting standards , the Language
>>>> links extraction process in wikipedia , how much does it take in terms of
>>>> time and do we dedicate special server for that ? or it doesn't need it's
>>>> just a small process ?
>>>>
>>>
>>> It's a small task compared to the wikipedia extraction. In the scale of
>>> only the language chapters it's around 15-30 minutes. But the initial ILL
>>> dump is created with the extraction process so it's not directly comparable
>>>
>>>
>>> Best,
>>> Dimitris
>>>
>>>
>>>> 4- any suggestions could be great
>>>>
>>>> thanks
>>>> Regards
>>>>
>>>> -------------------------------------------------
>>>> Hady El-Sahar
>>>> Research Assistant
>>>> Center of Informatics Sciences | Nile 
>>>> University<http://nileuniversity.edu.eg/>
>>>>
>>>> email : [email protected]
>>>> Phone : +2-01220887311
>>>> http://hadyelsahar.me/
>>>>
>>>> <http://www.linkedin.com/in/hadyelsahar>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> See everything from the browser to the database with AppDynamics
>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>> Start your free trial of AppDynamics Pro today!
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Dbpedia-developers mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>>
>>>>
>>>
>>>
>>> --
>>> Dimitris Kontokostas
>>> Department of Computer Science, University of Leipzig
>>> Research Group: http://aksw.org
>>> Homepage:http://aksw.org/DimitrisKontokostas
>>>
>>
>>
>>
>> --
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile 
>> University<http://nileuniversity.edu.eg/>
>>
>> email : [email protected]
>> Phone : +2-01220887311
>> http://hadyelsahar.me/
>>
>> <http://www.linkedin.com/in/hadyelsahar>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>
>
> --
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig
> Research Group: http://aksw.org
> Homepage:http://aksw.org/DimitrisKontokostas
>



-- 
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>

email : [email protected]
Phone : +2-01220887311
http://hadyelsahar.me/

<http://www.linkedin.com/in/hadyelsahar>

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [Dbpedia-developers] optimization of Language specific links extraction

Reply via email to