[Nutch-general] Re: nutch and multilingualism

Wray Buntine Wed, 08 Mar 2006 02:55:02 -0800

Ivan Sekulovic wrote:

Jérôme Charron wrote:
Would it be possible to generate ngram profiles for LanguageIdentifier
plugin from crawled content and not from file? What is my idea? Thebest
source for content in one language could be wikipedia.org.  We would
just crawl the wikipedia in desired language and then create ngram
profile from it. What are your thoughts about this idea?
I think it could be a good idea.
Wikipedia could be a good source (not sure the best one).
But instead of crawling wikipedia, it would probably be easier todownload a
wikipedia dump
(http://download.wikimedia.org/) and then extracts its textualcontent to a
file... no?
I agree for wikipedia. But because nutch is content fetching tool itwould be useful to have some sort of tool to use that content forcreating ngram profiles. It seems natural. Maybe it would be possibleto create some sort of export in plain text of indexed content..


We use this content on a  regular basis.   Short story is grab their

MediaWiki dump and do a simple text extractor, which will probably meedto be

modified somewhat regularly.  We have a more complex structured text
extractor in Perl we use because we want more of the structure retained.

Some issues:

1)  It is some of the largest and most varied language collections about.
2)  It is not typical text, no conversational, less commercial, and lots of

bizzare stuff that could confuse ngram analysis (e.g., charactertables)3) The standard dump is MediaWiki format which changes a bit almostevery month.They do supply some tools for conversion, e.g., a PHP script forconversion to HTML,

      but at any one point in time these are usually broken on many pages.

4) The alternatve is just to run the MediaWiki app. and crawl insitu.This takes almosta week on a single mediocre CPU box because MediaWiki converionis s-l-o-w.

     So recommend against this option.

5) We maintain a Perl MediaWiki to poor-man's HTML converter that weuse to retainbroad HTML structure. I expect to just extract text and retainproper

      sentence and word boundaries, your task will be easy.

Wray Buntine


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: nutch and multilingualism

Reply via email to