Re: [Mt-list] CFP: WAT2021 (The 8th Workshop on Asian Translation)

Adam Bittlingmayer Fri, 22 Jan 2021 01:02:39 -0800

Convenience.

Bicleaner and Zipporah are great tools that take a bit more technical work
to customize and use.


LASER is amazing, but it's really more for cross-language tasks like
toxicity classification, it was never intended specifically for this task.
So if the Chinese translation is just English (or has untranslated words),
or Japanese, pre-trained LASER won't catch it, because the "distance"
between the two sentences is indeed low.  Same with issues like negation,
or mismatched numbers.

Paracrawl, for example, has been cleaned with Bicleaner, and WikiMatrix
with LASER.  But when you run them through ModelFront, you still find
plenty of dirty, dirty sentence pairs.



On Wed, 20 Jan 2021 at 15:55, Nerses Nersesyan <nersesyanner...@gmail.com>
wrote:

> How's it different than Bicleaner or LASER?
>
> On Tue, Jan 19, 2021 at 4:09 PM <mt-list-requ...@eamt.org> wrote:
>
>> Send Mt-list mailing list submissions to
>>         mt-list@eamt.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         http://lists.eamt.org/mailman/listinfo/mt-list
>> or, via email, send a message with subject or body 'help' to
>>         mt-list-requ...@eamt.org
>>
>> You can reach the person managing the list at
>>         mt-list-ow...@eamt.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Mt-list digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Re: CFP: WAT2021 (The 8th Workshop on Asian Translation)
>>       (Adam Bittlingmayer)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 19 Jan 2021 12:00:26 +0400
>> From: Adam Bittlingmayer <a...@modelfront.com>
>> To: Toshiaki Nakazawa <nakaz...@logos.t.u-tokyo.ac.jp>
>> Cc: mt-list@eamt.org
>> Subject: Re: [Mt-list] CFP: WAT2021 (The 8th Workshop on Asian
>>         Translation)
>> Message-ID:
>>         <
>> calson-dwazwpk-v+znqmee4qqjyyzpmukf63h+afcw5dtyb...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> How well is it working for low-resource langs?
>> >
>>
>> We try to support all language pairs.  I've tried Inuktitut-English and
>> Hindi-Marathi, for example.
>>
>> The main factors are:
>>
>> 1. How dirty your parallel corpus is
>> In that sense, low-resource languages are often easier.  The relative
>> ranking just needs to be working.
>>
>> 3. How much data we have for the language
>> My own language (Alemannic) is *not* working well.  It's not in Mozilla
>> TMs, BERT or LASER, and has no standard orthography.  But a language like
>> Armenian, with a smaller number of speakers and lower GDP, is working
>> better, because their Wikipedia is top, and their unique script makes it
>> easy to identify.  In this conference, I expect Oriya/Odia and Khmer will
>> be the toughest.
>>
>> 2. How much data we have for the *pair*
>> We have seen Hindi-Marathi and Russian-Armenian working decently, but they
>> are well-established pairs with a lot of cultural overlap (Sprachbund).
>>
>> 3. Your use case
>> Training from scratch for a generic system on very large datasets is
>> different than fine-tuning for a domain on small data.  (For the former,
>> you usually want strict 1:1ness, e.g. miles should not convert to
>> kilometres.)  It won't work well out of the box if you're doing
>> adversarial
>> attacks or need it calibrated across language pairs.
>>
>> 4. If the low-resource language is the source or the target language
>> Just imagine a human doing this, who only knows one of the languages.
>>
>> There is an unknown language option (*other UND*) so you can even try it
>> on
>> languages not in the dropdown.  That works better if it's the source
>> language, not the target language.
>>
>> If you see issues or have data that can improve a language pair, let me
>> know.
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://lists.eamt.org/mailman/private/mt-list/attachments/20210119/f9a2c03d/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> Mt-list mailing list
>> Mt-list@eamt.org
>> http://lists.eamt.org/mailman/listinfo/mt-list
>>
>>
>> ------------------------------
>>
>> End of Mt-list Digest, Vol 88, Issue 16
>> ***************************************
>>
>
>
> --
> Best regards,
> Nerses Nersesyan
>
>
>
>

_______________________________________________
Mt-list site list
Mt-list@eamt.org
http://lists.eamt.org/mailman/listinfo/mt-list

Re: [Mt-list] CFP: WAT2021 (The 8th Workshop on Asian Translation)

Reply via email to