Hello Jacob Fix,
Can you please remove me from the list? I ask you personally, because I have asked generally for over two years, but I am still on it. Your last name is "Fix," so maybe you can actually "Fix" it :-)
Thanks,
--Alex
-------- Original Message --------
Subject: Re: [MarkLogic Dev General] question about
xdmp:encoding-language-detect
From: Jakob Fix <[email protected]>
Date: Fri, March 27, 2015 12:09 pm
To: MarkLogic Developer Discussion <[email protected]>
Thanks for your respective answers. My concern is that I've tried two
other detection services, the obvious one which is Google's
translation service which detected the language automatically, and
another one called detectlanguage.com which provides an API which also
detected correctly the language in the exact same text sample that I
used with MarkLogic's language detection feature.
cheers,
Jakob.
On Fri, Mar 27, 2015 at 5:01 PM, Justin Makeig
<[email protected]> wrote:
> Jakob,
> Are there any other markers that are specific to your domain that could help you triangulate? The built-in detection doesn't (and can't) know the context of your business. Some pre- or post-detection analysis might help you to better narrow. For example, is a specific source known to not have Croatian or Serbian content, but might have Latvian? Are there entities (e.g. names, addresses, etc.) that are decent indicators of Latvian? I don't know the specifics of your app or content, but there might be other context that you could pull in to enhance the out-of-the-box identification.
>
> Justin
>
>
> --
> Justin Makeig
> Director, Product Management
> MarkLogic
> [email protected]
> +1 (650) 655-2387
>
>> On Mar 27, 2015, at 8:44 AM, Jakob Fix <[email protected]> wrote:
>>
>> Thanks Mary for your quick reply. It's an explanation that I
>> understand, but this doesn't resolve my initial problem.
>> Any idea how to solve this in the short term and whether there are
>> improvements in the pipeline? Or that it's not a high priority?
>>
>> cheers,
>> Jakob.
>>
>>
>> On Fri, Mar 27, 2015 at 4:34 PM, Mary Holstege
>> <[email protected]> wrote:
>>> On Fri, 27 Mar 2015 08:23:19 -0700, Jakob Fix <[email protected]> wrote:
>>>
>>>> Hello, I think this message got lost when the mailing list was down in
>>>> February (or nobody has an answer ...)
>>>>
>>>> Thanks,
>>>> Jakob.
>>>
>>> The xdmp:encoding-language-detect uses the ICU libraries to do the
>>> detection. Serbian and Croatian are very closely related to each other and
>>> have some similar orthography to Latvian (although not a great deal of
>>> linguistic similarity, it must be said). I think the ICU libraries
>>> probably lack some of the linguistic sophistication of Google's backend.
>>>
>>> It has nothing to do with the licensing options.
>>>
>>> //Mary
>>>
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Jakob Fix <[email protected]>
>>>> Date: Sat, Feb 28, 2015 at 10:59 PM
>>>> Subject: question about xdmp:encoding-language-detect
>>>> To: General Mark Logic Developer Discussion
>>>> <[email protected]>
>>>>
>>>>
>>>> Hello,
>>>>
>>>> using ML7.0-3, the above function, given more than 3500 characters of
>>>> Latvian news story text, returns Croatian twice and Serbian once in
>>>> the top three results:
>>>>
>>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>> <encoding>utf-8</encoding>
>>>> <language>hr</language>
>>>> <score>7.081</score>
>>>> </encoding-language>
>>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>> <encoding>utf-8</encoding>
>>>> <language>hr</language>
>>>> <score>7.012</score>
>>>> </encoding-language>
>>>> <encoding-language xmlns="xdmp:encoding-language-detect">
>>>> <encoding>utf-8</encoding>
>>>> <language>sr</language>
>>>> <score>6.882</score>
>>>> </encoding-language>
>>>> ...
>>>>
>>>> and no Latvian in sight. Google translate as well as
>>>> detectlanguage.com correctly and with sufficient self-assurance return
>>>> the correct result.
>>>>
>>>> Can someone explain what the reason behind this lack of confidence and
>>>> the wrong detection is? Do you need the right language pack (I'm
>>>> playing around with the developer licence which I thought is
>>>> full-featured)? Is this something that needs training? The doc doesn't
>>>> say so.
>>>>
>>>> Thanks!
>>>>
>>>> cheers,
>>>> Jakob.
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>
>>>
>>> --
>>> Using Opera's revolutionary email client: http://www.opera.com/mail/
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>
>
>
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
