On Sat, Aug 7, 2010 at 9:21 AM, lmhelp <[email protected]> wrote:

>
> MY FIRST QUESTION IS:
> =====================
> I was wondering if you knew a better tool than this one... one which
> wouldn't "miss" some "Wikitext" chunks of code like in the above
> example (or maybe which at least would handle usual templates like
> "lang" and "formatnum")?
>
>
mwlib is the best parser available for folks who want to do a quick job such
as yours.



> MY SECOND QUESTION IS:
> ======================
> I was also wondering: the parser which is used in "Wikipedia" works
> pretty well... I mean: such things as above never happen... as far as
> I know...
> So my question is: is this parser available? Where?
> Can I use it with my Java code?
> And please, forgive me if this question is naïve...
>
>
You can use the dumpHTML maintenance script to convert wikitext to html, and
then you can use a dom library such as BeautifulSoup to grab all of the text
nodes. This approach is very similar to using mwlib, except that it trades
off using a lot of cpu time for being a little bit easier.
_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Reply via email to