Hi Andrea,
thanks! That's quite interesting. Just to clarify things, I have a few comments.
The actual extraction has been parallelized for years. One thread
reads the dump file, unpacks it, parses the XML, puts the wikitext and
meta data for each page into an object, and passes that object to one
of n worker threads (where n is the number of available logical
processors). The worker threads parse the wikitext and run it through
the extractors. See ExtractionJob.scala [1] and Workers.scala [2].
When several extractors are activated, this works very well (all CPUs
are fully employed). But when only a few simple extractors (e.g. the
DisambiguationExtractor) are active, the single-threaded work
dominates, and Amdahl's law says that increasing the number of worker
threads won't make things much faster (if at all), although it would
be worth a try: just change this line in ExtractionJob.scala (on the
dump branch)
private val workers = SimpleWorkers { page: WikiPage =>
to something like this
private val workers = SimpleWorkers(2.0, 1.0) { page: WikiPage =>
This means that there will be twice as many threads as CPU cores. May
speed things up a bit.
I don't know which part of the non-parallelized work is the slowest:
reading data from disk, bzip decompression, or XML parsing. Either
way, with the full dump files, it would be very hard to parallelize
one of these, so your approach is promising.
Cheers,
JC
[1]
https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/ExtractionJob.scala
[2]
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Workers.scala
On 30 October 2013 10:49, Andrea Di Menna <[email protected]> wrote:
> Hi all,
>
> I have been working on a few changes to support a naif parallelization of
> the extraction process on a multi-core machine.
> The idea is to simply use a multiplexed XMLSource, by leveraging the already
> split Wikipedia dumps (e.g. pages-articles1.xml-p\d+-p\d+.bz2), and an
> ExecutorService.
>
> The change is simple and is showing a boost on my machine (Intel® Core™ i7
> CPU 870 @ 2.93GHz × 8):
> I tested it running a single Extractor (namely the DisambiguationExtractor)
> on the enwiki-20130604 dumps and got a 2x boost (from about 120 mins to avg
> 55 mins).
>
> I think this gain could be higher only if the split xml dumps were less
> unequally distributed (they range from about 40MB to 1.8 GB in size) - for a
> reason [1]
>
> Has any of you ever produced a splitting tool for the Wikipedia dumps?
>
> Of course I am planning to contribute my changes back once I polished the
> code a bit.
>
> Cheers
> Andrea
>
> [1]
> https://meta.wikimedia.org/wiki/Data_dumps/FAQ#Why_are_some_of_the_en_wikipedia_files_split_so_unevenly.3F
>
> ------------------------------------------------------------------------------
> Android is increasing in popularity, but the open development platform that
> developers love is also attractive to malware creators. Download this white
> paper to learn more about secure code signing practices that can help keep
> Android apps secure.
> http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>
------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers