Dmitri, I have absolutely no experience with java interfacing, therefore I cannot answer you question with the translation memory.

Pity-pity-pity :-(

I think your idea is good, to make spell checking before translation.

Not exactly. You see, what we (users of OmegaT) want is spellcheck DURING translation or just upon it. The workflow with OmegaT is as follows (most briefly):
1) prepare files to translate in supported formats;
2) create a project and translate (when you load a project, OmegaT (like, actually any CAT tool) splits the text(s) into so called segments — minimal units to translate (it may be a line, a sentence, a paragraph — depending on file types and settings).
3) create target documents.

So, untill you make the step 3, you can't control any typing mistakes in the translation. The idea is to somehow engage a spellcheck engine to have this ability in OmegaT (possibly with any kind of highlighting spelling errors). Obviously, Hunspell would be a perfect option: it's free (LGPL, if I'm not mistaken) and it can use MySpell dictionaries which are already numerous.

If any embedding into OmegaT (Java) directly is not possible, is it possible to make a kinda bypass by checking the project's translation memory (I bet, this should be possible with a Perl script!). Some background: OmegaT stores translations memories as TMX files. TMX is an XML application, so it's a well-structured format. All translated segments as described above are stored as pairs of the source text and its translation. The source and the target are clearly labeled with language/locale tags. Such a pair is called a translation unit (TU). Here's an example of such a file:

========================================================================
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx11.dtd">
<tmx version="1.1">
  <header
    creationtool="OmegaT"
    creationtoolversion="1"
    segtype="paragraph"
    o-tmf="OmegaT TMX"
    adminlang="EN-US"
    srclang="EN-US"
    datatype="plaintext"
  >
  </header>
  <body>
    <tu>
      <tuv lang="EN-US">
        <seg>Cancel</seg>
      </tuv>
      <tuv lang="PL-PL">
        <seg>Anuluj</seg>
      </tuv>
    </tu>
    <tu>
      <tuv lang="EN-US">
        <seg>Close</seg>
      </tuv>
      <tuv lang="PL-PL">
        <seg>Zamknij</seg>
      </tuv>
    </tu>
  </body>
</tmx>
==========================================================

So, I envisage a scenario approximately like this:

1) run a script that reads and parses a TM file (AFAIK, Perl has libraries for handling XML); 2) the script reads each segment (I guess, SAX would be OK) and checks only translations (i.e., the contents of such <tuv></tuv>, where the “lang” attribute is DIFFERENT of the “srclang” in the header) and somehow displays mistakes.
3) it would be cool to have also the ability to correct mistakes.

Something like this. Well, I understand, it can be a real job. But maybe?

I'll also send a copy of this letter to the OmegaT group. Maybe, someone there can suggest something.

I'm afraid, I did not say this, though I should: THANK YOU :-)

Best regards,

Dmitri Gabinski
---
Лето - время покупать товары для отдыха и туризма!
http://shop.tut.by

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to