OCG contains a "plaintext" backend which generates quite nice plain-text versions of WP articles. Try clicking "create a book" in the enwiki sidebar, "start book creator", go to some article, click "add this page to your book" in the header then "show book", then change the format in the drop down to "Word processor (plain text)" and click "download".
You can also take the "download as PDF" link, something like https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Jack+Bosden&returnto=Jack+Bosden&oldid=741271566&writer=rdf2latex and replace the 'writer=rdf2latex' part at the end with 'writer=rdf2text', like: https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Jack+Bosden&returnto=Jack+Bosden&oldid=741271566&writer=rdf2text These tools can be used from the command-line, as described at https://github.com/wikimedia/mediawiki-extensions-Collection-OfflineContentGenerator-text_renderer I hope that helps! --scott On Fri, Nov 18, 2016 at 3:15 AM, Reem Al-Kashif <[email protected]> wrote: > Hi Scott, > > Thank you so much for your reply and offer to help with Parsoid. I used > DizzyLogic as an easy parser to get Wikipedia articles' content stripped > off the wiki markup. The results were in plain text files. I used it to > parse the whole English and Arabic Wikipedia dumps back in January. It was > easy to use because my coding knowledge is limited. > I read the link you kindly provided about Parsoid and I think it can help > me with parsing. However, I'm not sure how to start on testing this. > > Thank you :) > > Best, > Reem > > On 11 November 2016 at 19:55, C. Scott Ananian <[email protected]> > wrote: > >> It was removed from that article recently (19 Oct 2016: >> https://www.mediawiki.org/w/index.php?title=Alternativ >> e_parsers&type=revision&diff=2265815&oldid=2247632) with the following >> comment: >> >> "That link has been dead for over a year now as per this stackoverflow >> comment: http://stackoverflow.com/questions/13546254/whats-a-fast- >> way-to-parse-a-wikipedia-xml-dump-for-article-content-and-populate" >> >> If you'd like to explain what you would have used DizzyLogic for, I'd >> love to help you figure out how to use Parsoid to accomplish your goals. >> It's an officially-supported WMF parser which has much better correctness >> that any 'alternative' parser out there, implements a friendly API similar >> to mwparserfromhell (see https://doc.wikimedia.org >> /Parsoid/master/#!/guide/jsapi), and has a well-documented AST ( >> https://www.mediawiki.org/wiki/Specs/HTML/1.2.1) which can be directly >> fetched via the REST api (cf https://en.wikipedia.org/api/ ). I believe >> dumps have also been planned, but I'm not sure what the current status is. >> --scott >> >> >> On Fri, Nov 11, 2016 at 7:57 AM, Reem Al-Kashif <[email protected]> >> wrote: >> >>> Hi Pine, >>> >>> Thank you for your reply. It is an alternative parser. I believe I first >>> saw on MediaWiki (here >>> <http://t.sidekickopen68.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs7gbG1nW4WYnHT8q-c7CVRbxS056dC2Qf1b_0xC02?t=https%3A%2F%2Fwww.mediawiki.org%2Fwiki%2FAlternative_parsers&si=5334612837924864&pi=be9d881d-b222-408c-e571-5331aacb58c8> >>> ). >>> >>> Best, >>> Reem >>> >>> On 11 November 2016 at 09:47, Pine W <[email protected]> wrote: >>> >>>> Was this something on Labs? If so, it might have been purged during one >>>> of the Labs cleanups. >>>> >>>> Pine >>>> >>>> >>>> On Tue, Nov 8, 2016 at 2:33 PM, Reem Al-Kashif <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm just wondering if anybody knows what happened to DizzyLogic wiki >>>>> parser? The website and program vanished. I used it in January 2016 so I >>>>> know it was there at this time. >>>>> >>>>> Best, >>>>> Reem >>>>> >>>>> -- >>>>> >>>>> *Kind regards,Reem Al-Kashif* >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> >>> -- >>> >>> *Kind regards,Reem Al-Kashif* >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> >> -- >> (http://cscott.net) >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > > *Kind regards,Reem Al-Kashif* > -- (http://cscott.net)
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
