Hi all,
I have been working on a few changes to support a naif parallelization of
the extraction process on a multi-core machine.
The idea is to simply use a multiplexed XMLSource, by leveraging the
already split Wikipedia dumps (e.g. pages-articles1.xml-p\d+-p\d+.bz2), and
an ExecutorService.
The change is simple and is showing a boost on my machine (Intel® Core™ i7
CPU 870 @ 2.93GHz × 8):
I tested it running a single Extractor (namely the DisambiguationExtractor)
on the enwiki-20130604 dumps and got a 2x boost (from about 120 mins to avg
55 mins).
I think this gain could be higher only if the split xml dumps were less
unequally distributed (they range from about 40MB to 1.8 GB in size) - for
a reason [1]
Has any of you ever produced a splitting tool for the Wikipedia dumps?
Of course I am planning to contribute my changes back once I polished the
code a bit.
Cheers
Andrea
[1]
https://meta.wikimedia.org/wiki/Data_dumps/FAQ#Why_are_some_of_the_en_wikipedia_files_split_so_unevenly.3F
------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers