Re: [Apertium-stuff] GSoC Proposal: Diacritic restoration (was: Re: Helping you as a gsoc applicant.)

Alex Aruj Sat, 08 Mar 2014 22:01:50 -0800

Hi, this is an update on my adventures with charlifter.

I finally found a few moments to dive into charlifter (on Mac OSX, since I
have not committed to working on my Windows machine just yet). I have a
Spanish corpus and a language package charlifter-es-0.01 but I can't yet
try them out together. B/c I downloaded the language pack, I figure I can
skip training, however, I can't do anything now since I am at a roadblock.
What I am getting stuck on is the installation process of my pre-trained
language package downloaded from Lingala NLP on sourceforge.


 I guess it is prompting me to enter in "*$data*", or sf.pl cannot find
what it needs. Do I need es-probs.txt or the entire charlifter-es-0.01
package to be saved in a sf.pl-accessible location? I did not change any
paths in the makefile.

Typing in "*make install*" within my language package directory at Terminal
gives this...

*sf.pl <http://sf.pl> -m -l es*
*Reading in plain text hashes...*

which in sf.pl is...

line 185 print "*Reading in plain text hashes...*\n"; $|++;
my *$data* = do {
    if( open my $fh, '<:utf8', "$lang_arg-probs.txt" ) { local $/; <$fh> }
    else { undef }
    };
eval $data;

On a separate note, I do seem to have installed charlifter ok. Putting "
sf.pl" at command line, I get the 'help' for usage of charlifter, so to me
I have installed that part alright:

Usage: cat FULL-DIACRITIC-TRAININGCORPUS | sf.pl -t -l XX
or     cat TEXTTOFIX | sf.pl -r [-d DATAPATH] -l XX
or     cat FULL-DIACRITIC-TESTCORPUS | sf.pl -e [-d DATAPATH] -l XX
or     sf.pl -m -l XX (used by makefile)
Long versions of the command-line switches are available also:
-t/--train, -r/--restore, -e/--eval, -m/--make, -l/--lang, -d/--dir

Apologies if this was confusing, redundant or not applicable to this
mailing list.



On 28 February 2014 18:36, Kevin Scannell <[email protected]> wrote:

> On Fri, Feb 28, 2014 at 6:28 PM, Jimmy O'Regan <[email protected]> wrote:
> > On 28 February 2014 18:21, Alex Aruj <[email protected]> wrote:
> >> Hi group,
> >>
> ...
> >> Is the priority to make the charlifter case-sensitive and for it to
> respect
> >> superblanks exactly as in the example in the box laid out here
> >> http://wiki.apertium.org/wiki/Superblanks?
> >>
> >
> > Respecting superblanks is a must: diacritic restoration must not be
> > applied to them.
> >
> > Case should definitely be _respected_: the output needs to match the
> > input in terms of case.
> >
> > As for case sensitivity, Kevin Scannell is the person to ask for a
> > definitive answer.  My feeling is that case sensitivity can
> > potentially be more accurate, but in the absence of sufficient data,
> > case insensitive (trained on lowercase) should be the default.
> >
>
> This is spot on.  You'll do better in most cases with case sensitive
> models (e.g. for Jimmy: Irish "Éire" vs. "eire") unless there is very
> limited training data.
>
> For individual cases, you can always try both and see which performs
> better.
>
> Kevin
>



-- 
Alex

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC Proposal: Diacritic restoration (was: Re: Helping you as a gsoc applicant.)

Reply via email to