[MarkLogic Dev General] question about xdmp:encoding-language-detect

Jakob Fix Sat, 28 Feb 2015 14:00:33 -0800

Hello,

using ML7.0-3, the above function, given more than 3500 characters of
Latvian news story text, returns Croatian twice and Serbian once in
the top three results:


<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>hr</language>
  <score>7.081</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>hr</language>
  <score>7.012</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>utf-8</encoding>
  <language>sr</language>
  <score>6.882</score>
</encoding-language>
...

and no Latvian in sight. Google translate as well as
detectlanguage.com correctly and with sufficient self-assurance return
the correct result.

Can someone explain what the reason behind this lack of confidence and
the wrong detection is? Do you need the right language pack (I'm
playing around with the developer licence which I thought is
full-featured)? Is this something that needs training? The doc doesn't
say so.

Thanks!

cheers,
Jakob.
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] question about xdmp:encoding-language-detect

Reply via email to