On Sat, Aug 7, 2010 at 9:21 AM, lmhelp <[email protected]> wrote: > > MY FIRST QUESTION IS: > ===================== > I was wondering if you knew a better tool than this one... one which > wouldn't "miss" some "Wikitext" chunks of code like in the above > example (or maybe which at least would handle usual templates like > "lang" and "formatnum")? > > mwlib is the best parser available for folks who want to do a quick job such as yours.
> MY SECOND QUESTION IS: > ====================== > I was also wondering: the parser which is used in "Wikipedia" works > pretty well... I mean: such things as above never happen... as far as > I know... > So my question is: is this parser available? Where? > Can I use it with my Java code? > And please, forgive me if this question is naïve... > > You can use the dumpHTML maintenance script to convert wikitext to html, and then you can use a dom library such as BeautifulSoup to grab all of the text nodes. This approach is very similar to using mwlib, except that it trades off using a lot of cpu time for being a little bit easier. _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
