Re: [MarkLogic Dev General] question about xdmp:encoding-language-detect

Justin Makeig Fri, 27 Mar 2015 09:02:08 -0700

Jakob,
Are there any other markers that are specific to your domain that could help 
you triangulate? The built-in detection doesn't (and can't) know the context of 
your business. Some pre- or post-detection analysis might help you to better 
narrow. For example, is a specific source known to not have Croatian or Serbian 
content, but might have Latvian? Are there entities (e.g. names, addresses, 
etc.) that are decent indicators of Latvian? I don't know the specifics of your 
app or content, but there might be other context that you could pull in to 
enhance the out-of-the-box identification.


Justin


--
Justin Makeig
Director, Product Management
MarkLogic
[email protected]
+1 (650) 655-2387

> On Mar 27, 2015, at 8:44 AM, Jakob Fix <[email protected]> wrote:
> 
> Thanks Mary for your quick reply. It's an explanation that I
> understand, but this doesn't resolve my initial problem.
> Any idea how to solve this in the short term and whether there are
> improvements in the pipeline? Or that it's not a high priority?
> 
> cheers,
> Jakob.
> 
> 
> On Fri, Mar 27, 2015 at 4:34 PM, Mary Holstege
> <[email protected]> wrote:
>> On Fri, 27 Mar 2015 08:23:19 -0700, Jakob Fix <[email protected]> wrote:
>> 
>>> Hello, I think this message got lost when the mailing list was down in
>>> February (or nobody has an answer ...)
>>> 
>>> Thanks,
>>> Jakob.
>> 
>> The xdmp:encoding-language-detect uses the ICU libraries to do the
>> detection. Serbian and Croatian are very closely related to each other and
>> have some similar orthography to Latvian (although not a great deal of
>> linguistic similarity, it must be said). I think the ICU libraries
>> probably lack some of the linguistic sophistication of Google's backend.
>> 
>> It has nothing to do with the licensing options.
>> 
>> //Mary
>> 
>>> 
>>> ---------- Forwarded message ----------
>>> From: Jakob Fix <[email protected]>
>>> Date: Sat, Feb 28, 2015 at 10:59 PM
>>> Subject: question about xdmp:encoding-language-detect
>>> To: General Mark Logic Developer Discussion
>>> <[email protected]>
>>> 
>>> 
>>> Hello,
>>> 
>>> using ML7.0-3, the above function, given more than 3500 characters of
>>> Latvian news story text, returns Croatian twice and Serbian once in
>>> the top three results:
>>> 
>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>  <encoding>utf-8</encoding>
>>>  <language>hr</language>
>>>  <score>7.081</score>
>>> </encoding-language>
>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>  <encoding>utf-8</encoding>
>>>  <language>hr</language>
>>>  <score>7.012</score>
>>> </encoding-language>
>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>  <encoding>utf-8</encoding>
>>>  <language>sr</language>
>>>  <score>6.882</score>
>>> </encoding-language>
>>> ...
>>> 
>>> and no Latvian in sight. Google translate as well as
>>> detectlanguage.com correctly and with sufficient self-assurance return
>>> the correct result.
>>> 
>>> Can someone explain what the reason behind this lack of confidence and
>>> the wrong detection is? Do you need the right language pack (I'm
>>> playing around with the developer licence which I thought is
>>> full-featured)? Is this something that needs training? The doc doesn't
>>> say so.
>>> 
>>> Thanks!
>>> 
>>> cheers,
>>> Jakob.
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> 
>> --
>> Using Opera's revolutionary email client: http://www.opera.com/mail/
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] question about xdmp:encoding-language-detect

Reply via email to