Thanks Mary for your quick reply. It's an explanation that I understand, but this doesn't resolve my initial problem. Any idea how to solve this in the short term and whether there are improvements in the pipeline? Or that it's not a high priority?
cheers, Jakob. On Fri, Mar 27, 2015 at 4:34 PM, Mary Holstege <[email protected]> wrote: > On Fri, 27 Mar 2015 08:23:19 -0700, Jakob Fix <[email protected]> wrote: > >> Hello, I think this message got lost when the mailing list was down in >> February (or nobody has an answer ...) >> >> Thanks, >> Jakob. > > The xdmp:encoding-language-detect uses the ICU libraries to do the > detection. Serbian and Croatian are very closely related to each other and > have some similar orthography to Latvian (although not a great deal of > linguistic similarity, it must be said). I think the ICU libraries > probably lack some of the linguistic sophistication of Google's backend. > > It has nothing to do with the licensing options. > > //Mary > >> >> ---------- Forwarded message ---------- >> From: Jakob Fix <[email protected]> >> Date: Sat, Feb 28, 2015 at 10:59 PM >> Subject: question about xdmp:encoding-language-detect >> To: General Mark Logic Developer Discussion >> <[email protected]> >> >> >> Hello, >> >> using ML7.0-3, the above function, given more than 3500 characters of >> Latvian news story text, returns Croatian twice and Serbian once in >> the top three results: >> >> <encoding-language xmlns="xdmp:encoding-language-detect"> >> <encoding>utf-8</encoding> >> <language>hr</language> >> <score>7.081</score> >> </encoding-language> >> <encoding-language xmlns="xdmp:encoding-language-detect"> >> <encoding>utf-8</encoding> >> <language>hr</language> >> <score>7.012</score> >> </encoding-language> >> <encoding-language xmlns="xdmp:encoding-language-detect"> >> <encoding>utf-8</encoding> >> <language>sr</language> >> <score>6.882</score> >> </encoding-language> >> ... >> >> and no Latvian in sight. Google translate as well as >> detectlanguage.com correctly and with sufficient self-assurance return >> the correct result. >> >> Can someone explain what the reason behind this lack of confidence and >> the wrong detection is? Do you need the right language pack (I'm >> playing around with the developer licence which I thought is >> full-featured)? Is this something that needs training? The doc doesn't >> say so. >> >> Thanks! >> >> cheers, >> Jakob. >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general > > > -- > Using Opera's revolutionary email client: http://www.opera.com/mail/ > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
