Hi JC,
thanks very much for your clarifications!
I have been trying to add simple workers to the extraction process to test
whether I could reduce the extraction time.
No substantial gain unfortunately (almost no gain to be honest).
>From what I could see the main bottleneck is in the (reading data from disk
| bzip decompression | XML parsing) pipe, so the worker threads which
process WikiPage(s) [consumers] will mostly wait for new pages to be
available.
No matter how many they are.
The only way I see to speed this is to have more "producers" which can fill
the queue.
I will probably work on a XML dump splitter soon (the Mahout one is a bit
outdated - don't know if it makes sense to adapt it).
Cheers
Andrea
>
2013/10/30 Jona Christopher Sahnwaldt <[email protected]>
> Hi Andrea,
>
> thanks! That's quite interesting. Just to clarify things, I have a few
> comments.
>
> The actual extraction has been parallelized for years. One thread
> reads the dump file, unpacks it, parses the XML, puts the wikitext and
> meta data for each page into an object, and passes that object to one
> of n worker threads (where n is the number of available logical
> processors). The worker threads parse the wikitext and run it through
> the extractors. See ExtractionJob.scala [1] and Workers.scala [2].
>
> When several extractors are activated, this works very well (all CPUs
> are fully employed). But when only a few simple extractors (e.g. the
> DisambiguationExtractor) are active, the single-threaded work
> dominates, and Amdahl's law says that increasing the number of worker
> threads won't make things much faster (if at all), although it would
> be worth a try: just change this line in ExtractionJob.scala (on the
> dump branch)
>
> private val workers = SimpleWorkers { page: WikiPage =>
>
> to something like this
>
> private val workers = SimpleWorkers(2.0, 1.0) { page: WikiPage =>
>
> This means that there will be twice as many threads as CPU cores. May
> speed things up a bit.
>
> I don't know which part of the non-parallelized work is the slowest:
> reading data from disk, bzip decompression, or XML parsing. Either
> way, with the full dump files, it would be very hard to parallelize
> one of these, so your approach is promising.
>
> Cheers,
> JC
>
> [1]
> https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/ExtractionJob.scala
> [2]
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Workers.scala
>
> On 30 October 2013 10:49, Andrea Di Menna <[email protected]> wrote:
> > Hi all,
> >
> > I have been working on a few changes to support a naif parallelization of
> > the extraction process on a multi-core machine.
> > The idea is to simply use a multiplexed XMLSource, by leveraging the
> already
> > split Wikipedia dumps (e.g. pages-articles1.xml-p\d+-p\d+.bz2), and an
> > ExecutorService.
> >
> > The change is simple and is showing a boost on my machine (Intel® Core™
> i7
> > CPU 870 @ 2.93GHz × 8):
> > I tested it running a single Extractor (namely the
> DisambiguationExtractor)
> > on the enwiki-20130604 dumps and got a 2x boost (from about 120 mins to
> avg
> > 55 mins).
> >
> > I think this gain could be higher only if the split xml dumps were less
> > unequally distributed (they range from about 40MB to 1.8 GB in size) -
> for a
> > reason [1]
> >
> > Has any of you ever produced a splitting tool for the Wikipedia dumps?
> >
> > Of course I am planning to contribute my changes back once I polished the
> > code a bit.
> >
> > Cheers
> > Andrea
> >
> > [1]
> >
> https://meta.wikimedia.org/wiki/Data_dumps/FAQ#Why_are_some_of_the_en_wikipedia_files_split_so_unevenly.3F
> >
> >
> ------------------------------------------------------------------------------
> > Android is increasing in popularity, but the open development platform
> that
> > developers love is also attractive to malware creators. Download this
> white
> > paper to learn more about secure code signing practices that can help
> keep
> > Android apps secure.
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Dbpedia-developers mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
> >
>
------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers