Hi Hady,

Could you make an estimate on the total size of memory that you need for
every 1M Wikidata entries? This will give us a better overview.
You are free to make assumptions on the average data that you will need
(URI size, language #, ...)

I 'd also take a look at the "memory-mapped files" for an alternative.
I haven't used them with Java/Scala but from searching a little around
there is native support which makes them good candidate.

Cheers,
Dimitris


On Sun, Jul 21, 2013 at 5:03 PM, Hady elsahar <[email protected]> wrote:

> after some playing around and couple of consultancy on Stackoverflow
> here 
> <http://stackoverflow.com/questions/17737449/indexing-of-large-text-files-line-by-line-for-fast-access?noredirect=1#comment25863603_17737449>
>  and
> here<http://stackoverflow.com/questions/17739973/updating-line-in-large-text-file-using-scala/17740460?noredirect=1#17740460>
> the bottle neck here is indexing the triples file for the sake of fast
> access instead of line by line going through the file
>
> alternatives available :
>
> 1- indexing only lines of each subject in the memory + using fixed length
> triples lines and Random Access file to acess lines by specific lines fast
> 2- using a key-value store like Redis or something like SQLlight
> 3- sorting the file using Merge sort and then we don't need
> 4- using Map reduce
>
> i am implementing the first one and testing it's reliability on large data
> , though it seems like a hack but i guess it's suitable cuz it could be
> portable and needn't to install any libraries or infrastructure.
>
>
>    - what do you think best thing we should go through ? any other
>    suggestion?
>    - i always faced such problems and solved them by hacks and
>    workarounds but i always wondered what is the state of the art of dealing
>    with such problems if there's a standard for that. how do you guys in
>    DBpedia tackle such things ?
>
>
> thanks
> Regards
>
>
>
>
> On Thu, Jul 18, 2013 at 10:43 AM, Dimitris Kontokostas <
> [email protected]> wrote:
>
>> Hi Hady,
>>
>> You could re-use a lot of already defined utility functions for files &
>> triple parsing  but you are not so familiar with the framework yet so that
>> will come in time
>>  See inline for your questions
>>
>> On Thu, Jul 18, 2013 at 12:57 AM, Hady elsahar <[email protected]>wrote:
>>
>>> Hello all ,
>>>
>>> Hoping that everyone is enjoying the summer ,
>>>
>>> I've written a scala 
>>> script<https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala>to
>>>  do the task to generate LLlinks specific files to be uploaded as
>>> mentioned by JC here 
>>> <http://www.mail-archive.com/[email protected]/msg00148.html>
>>>
>>> option 0 in the script is for extracting the master LL file
>>> option 1 is for extracting language specific links files
>>>
>>> the first iteration of the code is of complexity O(n^2) , where n is the
>>> lines in the master LL file ,it seems so Dumb and would take a lot of time
>>> when running it on the big dumb, there's a lot of easy ways to optimize
>>> this but i had some questions :
>>>
>>> 1- could we depend that the triples RDF dump will be in order ? ie.(for
>>> example all Q1000 entity triples will come after each other and we don't
>>> need to parse the rest of the file for related triples )
>>>
>>
>> In general no. If you need them that way you can add a "sort" step in the
>> process pipeline
>>
>>
>>> 2- in that task which is better to optimize , memory vs time ?, loading
>>> file in a HashMap will optimize the speed a lot , but it may take some
>>> memory.
>>>
>>
>> We'd prefer time but it always depends. A few extra GB of memory should
>> be acceptable but if you want to load a map with all WikiData entries that
>> will not scale well
>>
>>
>>> 3-just for the sake of curiosity and setting standards , the Language
>>> links extraction process in wikipedia , how much does it take in terms of
>>> time and do we dedicate special server for that ? or it doesn't need it's
>>> just a small process ?
>>>
>>
>> It's a small task compared to the wikipedia extraction. In the scale of
>> only the language chapters it's around 15-30 minutes. But the initial ILL
>> dump is created with the extraction process so it's not directly comparable
>>
>>
>> Best,
>> Dimitris
>>
>>
>>> 4- any suggestions could be great
>>>
>>> thanks
>>> Regards
>>>
>>> -------------------------------------------------
>>> Hady El-Sahar
>>> Research Assistant
>>> Center of Informatics Sciences | Nile 
>>> University<http://nileuniversity.edu.eg/>
>>>
>>> email : [email protected]
>>> Phone : +2-01220887311
>>> http://hadyelsahar.me/
>>>
>>> <http://www.linkedin.com/in/hadyelsahar>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dbpedia-developers mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>>
>>>
>>
>>
>> --
>> Dimitris Kontokostas
>> Department of Computer Science, University of Leipzig
>> Research Group: http://aksw.org
>> Homepage:http://aksw.org/DimitrisKontokostas
>>
>
>
>
> --
> -------------------------------------------------
> Hady El-Sahar
> Research Assistant
> Center of Informatics Sciences | Nile 
> University<http://nileuniversity.edu.eg/>
>
> email : [email protected]
> Phone : +2-01220887311
> http://hadyelsahar.me/
>
> <http://www.linkedin.com/in/hadyelsahar>
>
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
>


-- 
Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Homepage:http://aksw.org/DimitrisKontokostas
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Reply via email to