Hi,
I wrote a simple Mediawiki XML dump splitter which uses the multistream
version of the pages-articles dump [1]
The splitter also needs an index file which contains offsets, as released
together with the multistream dump, see [2]-[3]
The splitter can be configured to produce combarable-sized XML dump chunks
(based on the file size and not the number of pages in it).
On my machine it takes about 5 minutes to split the huge dump into ~64MB
chunks.
Pros:
- can be useful when the number of cores available is bigger than the
number of chunks produced by Wikimedia (e.g. a 32 cores machine)
- could reduce the extraction time since the chunks are similar in size
Cons:
- the multistream file is bigger than the sequential file (download times
increase)
- it is required to also download the index file [3]
- splitting the multistream dump takes time (some minutes on my machine
with a SSD)
- the multistream file is generally produced and distributed at the end of
the dump process
Pull request already sent.
Please let me know what you think about it.
Cheers
Andrea
[1]
http://permalink.gmane.org/gmane.science.linguistics.wikipedia.research/1803
[2]
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
[3]
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream-index.txt.bz2
2013/10/30 Andrea Di Menna <[email protected]>
> Hi JC,
>
> thanks very much for your clarifications!
> I have been trying to add simple workers to the extraction process to test
> whether I could reduce the extraction time.
> No substantial gain unfortunately (almost no gain to be honest).
>
> From what I could see the main bottleneck is in the (reading data from
> disk | bzip decompression | XML parsing) pipe, so the worker threads which
> process WikiPage(s) [consumers] will mostly wait for new pages to be
> available.
> No matter how many they are.
> The only way I see to speed this is to have more "producers" which can
> fill the queue.
>
> I will probably work on a XML dump splitter soon (the Mahout one is a bit
> outdated - don't know if it makes sense to adapt it).
>
> Cheers
> Andrea
>
>
>>
>
> 2013/10/30 Jona Christopher Sahnwaldt <[email protected]>
>
>> Hi Andrea,
>>
>> thanks! That's quite interesting. Just to clarify things, I have a few
>> comments.
>>
>> The actual extraction has been parallelized for years. One thread
>> reads the dump file, unpacks it, parses the XML, puts the wikitext and
>> meta data for each page into an object, and passes that object to one
>> of n worker threads (where n is the number of available logical
>> processors). The worker threads parse the wikitext and run it through
>> the extractors. See ExtractionJob.scala [1] and Workers.scala [2].
>>
>> When several extractors are activated, this works very well (all CPUs
>> are fully employed). But when only a few simple extractors (e.g. the
>> DisambiguationExtractor) are active, the single-threaded work
>> dominates, and Amdahl's law says that increasing the number of worker
>> threads won't make things much faster (if at all), although it would
>> be worth a try: just change this line in ExtractionJob.scala (on the
>> dump branch)
>>
>> private val workers = SimpleWorkers { page: WikiPage =>
>>
>> to something like this
>>
>> private val workers = SimpleWorkers(2.0, 1.0) { page: WikiPage =>
>>
>> This means that there will be twice as many threads as CPU cores. May
>> speed things up a bit.
>>
>> I don't know which part of the non-parallelized work is the slowest:
>> reading data from disk, bzip decompression, or XML parsing. Either
>> way, with the full dump files, it would be very hard to parallelize
>> one of these, so your approach is promising.
>>
>> Cheers,
>> JC
>>
>> [1]
>> https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/ExtractionJob.scala
>> [2]
>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Workers.scala
>>
>> On 30 October 2013 10:49, Andrea Di Menna <[email protected]> wrote:
>> > Hi all,
>> >
>> > I have been working on a few changes to support a naif parallelization
>> of
>> > the extraction process on a multi-core machine.
>> > The idea is to simply use a multiplexed XMLSource, by leveraging the
>> already
>> > split Wikipedia dumps (e.g. pages-articles1.xml-p\d+-p\d+.bz2), and an
>> > ExecutorService.
>> >
>> > The change is simple and is showing a boost on my machine (Intel® Core™
>> i7
>> > CPU 870 @ 2.93GHz × 8):
>> > I tested it running a single Extractor (namely the
>> DisambiguationExtractor)
>> > on the enwiki-20130604 dumps and got a 2x boost (from about 120 mins to
>> avg
>> > 55 mins).
>> >
>> > I think this gain could be higher only if the split xml dumps were less
>> > unequally distributed (they range from about 40MB to 1.8 GB in size) -
>> for a
>> > reason [1]
>> >
>> > Has any of you ever produced a splitting tool for the Wikipedia dumps?
>> >
>> > Of course I am planning to contribute my changes back once I polished
>> the
>> > code a bit.
>> >
>> > Cheers
>> > Andrea
>> >
>> > [1]
>> >
>> https://meta.wikimedia.org/wiki/Data_dumps/FAQ#Why_are_some_of_the_en_wikipedia_files_split_so_unevenly.3F
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Android is increasing in popularity, but the open development platform
>> that
>> > developers love is also attractive to malware creators. Download this
>> white
>> > paper to learn more about secure code signing practices that can help
>> keep
>> > Android apps secure.
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Dbpedia-developers mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-developers
>> >
>>
>
>
------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers