Re: ITP: cld2 -- Compact Language Detector 2

Andreas Tille Tue, 10 Feb 2015 05:35:55 -0800

Hi Gianfranco,

this looks like a nice target for Debian Science linguistics tasks and
so it would be great to CC the list.  I assume you will maintain the
package in Debian Science team.


Kind regards

       Andreas.

On Tue, Feb 10, 2015 at 09:28:18AM +0000, Gianfranco Costamagna wrote:
> Package: wnpp
> Severity: wishlist
> Owner: Gianfranco Costamagna <[email protected]>
> 
> * Package name    : cld2
> Version         : 0.0.0~svn193
> Upstream Author : Dick Sites [email protected] 
> * URL             : https://code.google.com/p/cld2/
> * License         : Apache-2.0
> Programming Lang: C++
> Description     : Compact Language Detector 2
> 
> CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, 
> either plain text or HTML/XML.
> Legacy encodings must be converted to valid UTF-8 by the caller. For 
> mixed-language input,
> CLD2 returns the top three languages found and their approximate percentages 
> of the total
> text bytes (e.g. 80% English and 20% French out of 1000 bytes of text means 
> about 800 bytes
> of English and 200 bytes of French). Optionally, it also returns a vector of 
> text spans with
> the language of each identified. This may be useful for applying different 
> spelling-correction
> dictionaries or different machine translation requests to each span. The 
> design target is web
> pages of at least 200 characters (about two sentences); CLD2 is not designed 
> to do well on very
> short text, lists of proper names, part numbers, etc.
> 
> CLD2 is a Naïve Bayesian classifier, using one of three different token 
> algorithms. For Unicode
> scripts such as Greek and Thai that map one-to-one to detected languages, the 
> script defines
> the result. For the 80,000+ character Han script and its CJK combination with 
> Hiragana,
> Katakana, and Hangul scripts, single letters (unigrams) are scored. For all 
> other scripts,
> sequences of four letters (quadgrams) are scored.
> 
> Scoring is done exclusively on lowercased Unicode letters and marks, after 
> expanding HTML
> entities &xyz; and after deleting digits, punctuation, and <tags>. Quadgram 
> word beginnings
> and endings (indicated here by underscore) are explicitly used, so the word 
> _look_ scores
> differently from the word-beginning _look or the mid-word look. Quadgram 
> single-letter
> "words" are completely ignored. For each letter sequence, the scoring uses 
> the 3-6 most
> likely languages and their quantized log probabilities. The training corpus 
> is manually
> constructed from chosen web pages for each language, then augmented by 
> careful automated
> scraping of over 100M additional web pages.
> 
> Several embellishments improve the basic algorithm: additional scoring of 
> some sequences
> of two CJK letters or eight other letters; scoring some words and word pairs 
> that are
> distinctive within sets of statistically-close languages such as {Malay, 
> Indonesian}
> or {Spanish, Portuguese, Galician}; removing repetitive sequences/words that 
> would
> otherwise skew the scoring, such as “jpg” in “foo.jpg bar.jpg baz.jpg”; 
> removing
> web-specific words that convey almost no language information such as page, 
> link,
> click, td, tr, copyright, wikipedia, http.
> 
> Several hints can be supplied. Because these can be inaccurate on web pages, 
> they
> are just hints -- they add a bias but do not force a specific language to be 
> the
> detection result. The hints include expected language, original document 
> encoding,
> document URL top-level domain name, and embedded <…lang=xx …> language tags.
> 
> The table-driven extraction of letter sequences and table-driven scoring is 
> highly optimized
> for both space and speed, running about 10x faster than other detectors and 
> covering over 70
> languages in 1.8MB of x86 code and tables. The main quadgram lookup table 
> consists of 256K
> four-byte entries, covering about 50 languages. Detection over the average 
> web page of 30KB
> (half tags/digits/punctuation, half letters) takes roughly 1 msec on a 
> current x86 processor.
> 
> CLD2 is an update of the prior CLD, adding more languages, updating to 
> Unicode 6.2 characters,
> improving scoring, and adding the optional output vector of labelled language 
> spans.
> 
> These 83 languages are detected: Afrikaans Albanian Arabic Armenian 
> Azerbaijani Basque Belarusian
> Bengali Bihari Bulgarian Catalan Cebuano Cherokee Croatian Czech Chinese 
> Chinese_T Danish Dhivehi
> Dutch English Estonian Finnish French Galician Ganda Georgian German Greek 
> Gujarati Haitian_Creole
> Hebrew Hindi Hmong Hungarian Icelandic Indonesian Inuktitut Irish Italian 
> Javanese Japanese Kannada
> Khmer Kinyarwanda Korean Laothian Latvian Limbu Lithuanian Macedonian Malay 
> Malayalam Maltese
> Marathi Nepali Norwegian Oriya Persian Polish Portuguese Punjabi Romanian 
> Russian Scots_Gaelic
> Serbian Sinhalese Slovak Slovenian Spanish Swahili Swedish Syriac Tagalog 
> Tamil Telugu Thai
> Turkish Ukrainian Urdu Vietnamese Welsh Yiddish.
> 
> 
> Useful for the upcoming poedit 1.8 release.
> 
> 
> --
> To UNSUBSCRIBE, email to [email protected]
> with a subject of "unsubscribe". Trouble? Contact [email protected]
> Archive: 
> https://lists.debian.org/[email protected]
> 
> 

-- 
http://fam-tille.de


-- 
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]
Archive: https://lists.debian.org/[email protected]

Re: ITP: cld2 -- Compact Language Detector 2

Reply via email to