Re: [lingu-dev] Perl support for the hunspell library

eleonora46 Mon, 10 Jul 2006 05:33:25 -0700

Dmitri,

Thanks for the information about OmegaT's internals.


The perl interfacing to hunspell is really trivial:

1. create speller object to a language
my $speller = Text::Hunspell->new("/.../test.aff", "/.../test.dic");

2. do spell check
 $speller->check( $word );
(result: 1 if found, 0 if not:

3. if not found, give suggestions:
@suggestions = $speller->suggest( $misspelled );

4. delete spell object.
$speller->delete($speller);

I think, the above information helps a bit for designing a spelling  interface 
to OmegatT. Maybe you could forward also this information to the Omega group.

There is a very similar perl interface to aspell also, Text::Aspell. (it was my 
sample for the hunspell one). Aspell is mighty in suggestions, but it misses 
forbidden words and twofold affixing at the moment.

Regards: Eleonora


> >I think your 
> >idea is good, to make spell checking before translation.
> 
> Not exactly. You see, what we (users of OmegaT) want is spellcheck DURING 
> translation or just upon it. The workflow with OmegaT is as follows (most 
> briefly):
> 1) prepare files to translate in supported formats;
> 2) create a project and translate (when you load a project, OmegaT (like, 
> actually any CAT tool) splits the text(s) into so called segments — 
> minimal units to translate (it may be a line, a sentence, a paragraph — 
> depending on file types and settings).
> 3) create target documents.
> 
> So, untill you make the step 3, you can't control any typing mistakes in 
> the translation. The idea is to somehow engage a spellcheck engine to have
> this ability in OmegaT (possibly with any kind of highlighting spelling 
> errors). Obviously, Hunspell would be a perfect option: it's free (LGPL, 
> if I'm not mistaken) and it can use MySpell dictionaries which are already
> numerous.
> 
> If any embedding into OmegaT (Java) directly is not possible, is it 
> possible to make a kinda bypass by checking the project's translation 
> memory (I bet, this should be possible with a Perl script!). Some 
> background: OmegaT stores translations memories as TMX files. TMX is an 
> XML application, so it's a well-structured format. All translated segments
> as described above are stored as pairs of the source text and its 
> translation. The source and the target are clearly labeled with 
> language/locale tags. Such a pair is called a translation unit (TU). 
> Here's an example of such a file:
> 
> ========================================================================
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE tmx SYSTEM "tmx11.dtd">
> <tmx version="1.1">
>    <header
>      creationtool="OmegaT"
>      creationtoolversion="1"
>      segtype="paragraph"
>      o-tmf="OmegaT TMX"
>      adminlang="EN-US"
>      srclang="EN-US"
>      datatype="plaintext"
>    >
>    </header>
>    <body>
>      <tu>
>        <tuv lang="EN-US">
>          <seg>Cancel</seg>
>        </tuv>
>        <tuv lang="PL-PL">
>          <seg>Anuluj</seg>
>        </tuv>
>      </tu>
>      <tu>
>        <tuv lang="EN-US">
>          <seg>Close</seg>
>        </tuv>
>        <tuv lang="PL-PL">
>          <seg>Zamknij</seg>
>        </tuv>
>      </tu>
>    </body>
> </tmx>
> ==========================================================
> 
> So, I envisage a scenario approximately like this:
> 
> 1) run a script that reads and parses a TM file (AFAIK, Perl has libraries
> for handling XML);
> 2) the script reads each segment (I guess, SAX would be OK) and checks 
> only translations (i.e., the contents of such <tuv></tuv>, where the 
> “lang” attribute is DIFFERENT of the “srclang” in the header) and
> somehow 
> displays mistakes.
> 3) it would be cool to have also the ability to correct mistakes.
> 
> Something like this. Well, I understand, it can be a real job. But maybe?
> 
> I'll also send a copy of this letter to the OmegaT group. Maybe, someone 
> there can suggest something.
> 
> I'm afraid, I did not say this, though I should: THANK YOU :-)

-- 


"Feel free" – 10 GB Mailbox, 100 FreeSMS/Monat ...
Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Perl support for the hunspell library

Reply via email to