Re: [lingu-dev] Perl support for the hunspell library

Dmitri Gabinski Mon, 10 Jul 2006 05:05:11 -0700

Dmitri, I have absolutely no experience with java interfacing, thereforeI cannot answer you question with the translation memory.


Pity-pity-pity :-(

I think youridea is good, to make spell checking before translation.

Not exactly. You see, what we (users of OmegaT) want is spellcheck DURINGtranslation or just upon it. The workflow with OmegaT is as follows (mostbriefly):

1) prepare files to translate in supported formats;

2) create a project and translate (when you load a project, OmegaT (like,actually any CAT tool) splits the text(s) into so called segments —minimal units to translate (it may be a line, a sentence, a paragraph —depending on file types and settings).

3) create target documents.

So, untill you make the step 3, you can't control any typing mistakes inthe translation. The idea is to somehow engage a spellcheck engine to havethis ability in OmegaT (possibly with any kind of highlighting spellingerrors). Obviously, Hunspell would be a perfect option: it's free (LGPL,if I'm not mistaken) and it can use MySpell dictionaries which are alreadynumerous.

If any embedding into OmegaT (Java) directly is not possible, is itpossible to make a kinda bypass by checking the project's translationmemory (I bet, this should be possible with a Perl script!). Somebackground: OmegaT stores translations memories as TMX files. TMX is anXML application, so it's a well-structured format. All translated segmentsas described above are stored as pairs of the source text and itstranslation. The source and the target are clearly labeled withlanguage/locale tags. Such a pair is called a translation unit (TU).Here's an example of such a file:


========================================================================
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx11.dtd">
<tmx version="1.1">
  <header
    creationtool="OmegaT"
    creationtoolversion="1"
    segtype="paragraph"
    o-tmf="OmegaT TMX"
    adminlang="EN-US"
    srclang="EN-US"
    datatype="plaintext"
  >
  </header>
  <body>
    <tu>
      <tuv lang="EN-US">
        <seg>Cancel</seg>
      </tuv>
      <tuv lang="PL-PL">
        <seg>Anuluj</seg>
      </tuv>
    </tu>
    <tu>
      <tuv lang="EN-US">
        <seg>Close</seg>
      </tuv>
      <tuv lang="PL-PL">
        <seg>Zamknij</seg>
      </tuv>
    </tu>
  </body>
</tmx>
==========================================================

So, I envisage a scenario approximately like this:

1) run a script that reads and parses a TM file (AFAIK, Perl has librariesfor handling XML);2) the script reads each segment (I guess, SAX would be OK) and checksonly translations (i.e., the contents of such <tuv></tuv>, where the“lang” attribute is DIFFERENT of the “srclang” in the header) and somehowdisplays mistakes.

3) it would be cool to have also the ability to correct mistakes.

Something like this. Well, I understand, it can be a real job. But maybe?

I'll also send a copy of this letter to the OmegaT group. Maybe, someonethere can suggest something.


I'm afraid, I did not say this, though I should: THANK YOU :-)

Best regards,

Dmitri Gabinski

---

Лето - время покупать товары для отдыха и туризма!
http://shop.tut.by

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Perl support for the hunspell library

Reply via email to