Jakob, Are there any other markers that are specific to your domain that could help you triangulate? The built-in detection doesn't (and can't) know the context of your business. Some pre- or post-detection analysis might help you to better narrow. For example, is a specific source known to not have Croatian or Serbian content, but might have Latvian? Are there entities (e.g. names, addresses, etc.) that are decent indicators of Latvian? I don't know the specifics of your app or content, but there might be other context that you could pull in to enhance the out-of-the-box identification.
Justin -- Justin Makeig Director, Product Management MarkLogic [email protected] +1 (650) 655-2387 > On Mar 27, 2015, at 8:44 AM, Jakob Fix <[email protected]> wrote: > > Thanks Mary for your quick reply. It's an explanation that I > understand, but this doesn't resolve my initial problem. > Any idea how to solve this in the short term and whether there are > improvements in the pipeline? Or that it's not a high priority? > > cheers, > Jakob. > > > On Fri, Mar 27, 2015 at 4:34 PM, Mary Holstege > <[email protected]> wrote: >> On Fri, 27 Mar 2015 08:23:19 -0700, Jakob Fix <[email protected]> wrote: >> >>> Hello, I think this message got lost when the mailing list was down in >>> February (or nobody has an answer ...) >>> >>> Thanks, >>> Jakob. >> >> The xdmp:encoding-language-detect uses the ICU libraries to do the >> detection. Serbian and Croatian are very closely related to each other and >> have some similar orthography to Latvian (although not a great deal of >> linguistic similarity, it must be said). I think the ICU libraries >> probably lack some of the linguistic sophistication of Google's backend. >> >> It has nothing to do with the licensing options. >> >> //Mary >> >>> >>> ---------- Forwarded message ---------- >>> From: Jakob Fix <[email protected]> >>> Date: Sat, Feb 28, 2015 at 10:59 PM >>> Subject: question about xdmp:encoding-language-detect >>> To: General Mark Logic Developer Discussion >>> <[email protected]> >>> >>> >>> Hello, >>> >>> using ML7.0-3, the above function, given more than 3500 characters of >>> Latvian news story text, returns Croatian twice and Serbian once in >>> the top three results: >>> >>> <encoding-language xmlns="xdmp:encoding-language-detect"> >>> <encoding>utf-8</encoding> >>> <language>hr</language> >>> <score>7.081</score> >>> </encoding-language> >>> <encoding-language xmlns="xdmp:encoding-language-detect"> >>> <encoding>utf-8</encoding> >>> <language>hr</language> >>> <score>7.012</score> >>> </encoding-language> >>> <encoding-language xmlns="xdmp:encoding-language-detect"> >>> <encoding>utf-8</encoding> >>> <language>sr</language> >>> <score>6.882</score> >>> </encoding-language> >>> ... >>> >>> and no Latvian in sight. Google translate as well as >>> detectlanguage.com correctly and with sufficient self-assurance return >>> the correct result. >>> >>> Can someone explain what the reason behind this lack of confidence and >>> the wrong detection is? Do you need the right language pack (I'm >>> playing around with the developer licence which I thought is >>> full-featured)? Is this something that needs training? The doc doesn't >>> say so. >>> >>> Thanks! >>> >>> cheers, >>> Jakob. >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >> >> >> -- >> Using Opera's revolutionary email client: http://www.opera.com/mail/ >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
