after some playing around and couple of consultancy on Stackoverflow
here 
<http://stackoverflow.com/questions/17737449/indexing-of-large-text-files-line-by-line-for-fast-access?noredirect=1#comment25863603_17737449>
and
here<http://stackoverflow.com/questions/17739973/updating-line-in-large-text-file-using-scala/17740460?noredirect=1#17740460>
the bottle neck here is indexing the triples file for the sake of fast
access instead of line by line going through the file

alternatives available :

1- indexing only lines of each subject in the memory + using fixed length
triples lines and Random Access file to acess lines by specific lines fast
2- using a key-value store like Redis or something like SQLlight
3- sorting the file using Merge sort and then we don't need
4- using Map reduce

i am implementing the first one and testing it's reliability on large data
, though it seems like a hack but i guess it's suitable cuz it could be
portable and needn't to install any libraries or infrastructure.


   - what do you think best thing we should go through ? any other
   suggestion?
   - i always faced such problems and solved them by hacks and workarounds
   but i always wondered what is the state of the art of dealing with such
   problems if there's a standard for that. how do you guys in DBpedia tackle
   such things ?


thanks
Regards




On Thu, Jul 18, 2013 at 10:43 AM, Dimitris Kontokostas <
[email protected]> wrote:

> Hi Hady,
>
> You could re-use a lot of already defined utility functions for files &
> triple parsing  but you are not so familiar with the framework yet so that
> will come in time
>  See inline for your questions
>
> On Thu, Jul 18, 2013 at 12:57 AM, Hady elsahar <[email protected]>wrote:
>
>> Hello all ,
>>
>> Hoping that everyone is enjoying the summer ,
>>
>> I've written a scala 
>> script<https://github.com/hadyelsahar/extraction-framework/blob/lang-link-extract/scripts/src/main/scala/org/dbpedia/extraction/scripts/LanguageSpecificLinksGenerator.scala>to
>>  do the task to generate LLlinks specific files to be uploaded as
>> mentioned by JC here 
>> <http://www.mail-archive.com/[email protected]/msg00148.html>
>>
>> option 0 in the script is for extracting the master LL file
>> option 1 is for extracting language specific links files
>>
>> the first iteration of the code is of complexity O(n^2) , where n is the
>> lines in the master LL file ,it seems so Dumb and would take a lot of time
>> when running it on the big dumb, there's a lot of easy ways to optimize
>> this but i had some questions :
>>
>> 1- could we depend that the triples RDF dump will be in order ? ie.(for
>> example all Q1000 entity triples will come after each other and we don't
>> need to parse the rest of the file for related triples )
>>
>
> In general no. If you need them that way you can add a "sort" step in the
> process pipeline
>
>
>> 2- in that task which is better to optimize , memory vs time ?, loading
>> file in a HashMap will optimize the speed a lot , but it may take some
>> memory.
>>
>
> We'd prefer time but it always depends. A few extra GB of memory should be
> acceptable but if you want to load a map with all WikiData entries that
> will not scale well
>
>
>> 3-just for the sake of curiosity and setting standards , the Language
>> links extraction process in wikipedia , how much does it take in terms of
>> time and do we dedicate special server for that ? or it doesn't need it's
>> just a small process ?
>>
>
> It's a small task compared to the wikipedia extraction. In the scale of
> only the language chapters it's around 15-30 minutes. But the initial ILL
> dump is created with the extraction process so it's not directly comparable
>
>
> Best,
> Dimitris
>
>
>> 4- any suggestions could be great
>>
>> thanks
>> Regards
>>
>> -------------------------------------------------
>> Hady El-Sahar
>> Research Assistant
>> Center of Informatics Sciences | Nile 
>> University<http://nileuniversity.edu.eg/>
>>
>> email : [email protected]
>> Phone : +2-01220887311
>> http://hadyelsahar.me/
>>
>> <http://www.linkedin.com/in/hadyelsahar>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>>
>>
>
>
> --
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig
> Research Group: http://aksw.org
> Homepage:http://aksw.org/DimitrisKontokostas
>



-- 
-------------------------------------------------
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University<http://nileuniversity.edu.eg/>

email : [email protected]
Phone : +2-01220887311
http://hadyelsahar.me/

<http://www.linkedin.com/in/hadyelsahar>
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Reply via email to