Hi Andreas, I do not have a target (yet), so debian science might be the best target :)
thanks for pointing that to me! (I was wondering about debian-l10n) cheers, Gianfranco Il Martedì 10 Febbraio 2015 14:35, Andreas Tille <[email protected]> ha scritto: Hi Gianfranco, this looks like a nice target for Debian Science linguistics tasks and so it would be great to CC the list. I assume you will maintain the package in Debian Science team. Kind regards Andreas. On Tue, Feb 10, 2015 at 09:28:18AM +0000, Gianfranco Costamagna wrote: > Package: wnpp > Severity: wishlist > Owner: Gianfranco Costamagna <[email protected]> > > * Package name : cld2 > Version : 0.0.0~svn193 > Upstream Author : Dick Sites [email protected] > * URL : https://code.google.com/p/cld2/ > * License : Apache-2.0 > Programming Lang: C++ > Description : Compact Language Detector 2 > > CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, > either plain text or HTML/XML. > Legacy encodings must be converted to valid UTF-8 by the caller. For > mixed-language input, > CLD2 returns the top three languages found and their approximate percentages > of the total > text bytes (e.g. 80% English and 20% French out of 1000 bytes of text means > about 800 bytes > of English and 200 bytes of French). Optionally, it also returns a vector of > text spans with > the language of each identified. This may be useful for applying different > spelling-correction > dictionaries or different machine translation requests to each span. The > design target is web > pages of at least 200 characters (about two sentences); CLD2 is not designed > to do well on very > short text, lists of proper names, part numbers, etc. > > CLD2 is a Naïve Bayesian classifier, using one of three different token > algorithms. For Unicode > scripts such as Greek and Thai that map one-to-one to detected languages, the > script defines > the result. For the 80,000+ character Han script and its CJK combination with > Hiragana, > Katakana, and Hangul scripts, single letters (unigrams) are scored. For all > other scripts, > sequences of four letters (quadgrams) are scored. > > Scoring is done exclusively on lowercased Unicode letters and marks, after > expanding HTML > entities &xyz; and after deleting digits, punctuation, and <tags>. Quadgram > word beginnings > and endings (indicated here by underscore) are explicitly used, so the word > _look_ scores > differently from the word-beginning _look or the mid-word look. Quadgram > single-letter > "words" are completely ignored. For each letter sequence, the scoring uses > the 3-6 most > likely languages and their quantized log probabilities. The training corpus > is manually > constructed from chosen web pages for each language, then augmented by > careful automated > scraping of over 100M additional web pages. > > Several embellishments improve the basic algorithm: additional scoring of > some sequences > of two CJK letters or eight other letters; scoring some words and word pairs > that are > distinctive within sets of statistically-close languages such as {Malay, > Indonesian} > or {Spanish, Portuguese, Galician}; removing repetitive sequences/words that > would > otherwise skew the scoring, such as “jpg” in “foo.jpg bar.jpg baz.jpg”; > removing > web-specific words that convey almost no language information such as page, > link, > click, td, tr, copyright, wikipedia, http. > > Several hints can be supplied. Because these can be inaccurate on web pages, > they > are just hints -- they add a bias but do not force a specific language to be > the > detection result. The hints include expected language, original document > encoding, > document URL top-level domain name, and embedded <…lang=xx …> language tags. > > The table-driven extraction of letter sequences and table-driven scoring is > highly optimized > for both space and speed, running about 10x faster than other detectors and > covering over 70 > languages in 1.8MB of x86 code and tables. The main quadgram lookup table > consists of 256K > four-byte entries, covering about 50 languages. Detection over the average > web page of 30KB > (half tags/digits/punctuation, half letters) takes roughly 1 msec on a > current x86 processor. > > CLD2 is an update of the prior CLD, adding more languages, updating to > Unicode 6.2 characters, > improving scoring, and adding the optional output vector of labelled language > spans. > > These 83 languages are detected: Afrikaans Albanian Arabic Armenian > Azerbaijani Basque Belarusian > Bengali Bihari Bulgarian Catalan Cebuano Cherokee Croatian Czech Chinese > Chinese_T Danish Dhivehi > Dutch English Estonian Finnish French Galician Ganda Georgian German Greek > Gujarati Haitian_Creole > Hebrew Hindi Hmong Hungarian Icelandic Indonesian Inuktitut Irish Italian > Javanese Japanese Kannada > Khmer Kinyarwanda Korean Laothian Latvian Limbu Lithuanian Macedonian Malay > Malayalam Maltese > Marathi Nepali Norwegian Oriya Persian Polish Portuguese Punjabi Romanian > Russian Scots_Gaelic > Serbian Sinhalese Slovak Slovenian Spanish Swahili Swedish Syriac Tagalog > Tamil Telugu Thai > Turkish Ukrainian Urdu Vietnamese Welsh Yiddish. > > > Useful for the upcoming poedit 1.8 release. > > > -- > To UNSUBSCRIBE, email to [email protected] > with a subject of "unsubscribe". Trouble? Contact [email protected] > Archive: > https://lists.debian.org/[email protected] > > -- http://fam-tille.de -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected] Archive: https://lists.debian.org/[email protected]

