Hi Miquel,
Thank you. I found out that the script in fact uses a copy of the moses
script:
bitextor/trunk/utils/tokenizer.perl and has non-breaking prefixes in
bitextor/trunk/utils/nonbreaking_prefixes

I should have looked more into the script ...

Anyhow, I don't need to create any non-breaking prefixes, I can either
copy the list from Moses to bitextor, or use the Moses script instead.
Excellent.

Yours,
Per Tunedal

On Mon, Mar 3, 2014, at 14:35, Miquel Esplà wrote:

Hi Per,

the script that prints that warning message is the tokeniser provided
with Moses (you can find this script
here: [1]https://github.com/moses-smt/mosesdecoder/blob/master/scripts/
tokenizer/tokenizer.perl). There is a sub-folder somewhere in the code
which is called nonbreaking_preffixes containing a collection of
documents with these lists for several languages. This sub-folder is
also provided with
moses: [2]https://github.com/moses-smt/mosesdecoder/tree/master/scripts
/share/nonbreaking_prefixes. I don't know how to create these files
automatically, but, if you have a look in the link to this sub-folder
in the Moses repository, you will see that there is already one for
Swedish
([3]https://github.com/moses-smt/mosesdecoder/blob/master/scripts/share
/nonbreaking_prefixes/nonbreaking_prefix.sv), so you only have to
download it and put it in the sub-folder in your system.

Best,

Miquel.


2014-03-03 12:59 GMT+01:00 Per Tunedal <[4][email protected]>:

Hi again Miquel,
I've manually replaced the variables and the script
bitextor-builddics.sh works like a charm!

I've got a complaint about a missing list of Swedish abbreviations
though:

TOKENISING THE CORPUS...
WARNING: No known abbreviations for language 'sv', attempting fall-back
to English version...

Where do I find those lists of abbreviations (what program, what
folder)? It would be quite easy for me to supply such a list as I've
already done it to Apertium-sv-da and to bligner.py

Yours,
Per Tunedal

On Thu, Feb 20, 2014, at 19:48, Miquel Esplà wrote:

Well, of course you can try to replace manually the variables by paths
(as I told you, you have to try to replace variables starting and
ending with __). I don't think I can help you much more because I never
did this, but I'm sure that with a bit of patiente you will do it ;)
Good luck!

Cheers,

Miquel.

---snip---


> >
> > I'm sorry, I didn't explain it well: as I said,
[5]bitextor-builddics.in is
> > only the template of the script. What I didn't say is that you need
to
> > compile the project to get the true script. If you have a look into
the
> > code of the template, you will see that there are many variables
starting
> > and ending with "__" (such as __PREFFIX__). These variables are
> > replaced  by the corresponding paths at compilation time. So, to
use the
> > script, you have to download the whole trunk directory, and then to
run:
> > ./autogen.sh
> > ./configure
> > make
> > make install
> >
> > As you know, you can use the option --prefix=LOCALDIR when running
> > ./configure to install bitextor in a specific path (for example
LOCALDIR could
> > be /home/per/local/).
> >
> > Best,
> >
> > Miquel.
> >
> >
> >
> >  Yours,
> > Per Tunedal
> >
> > On Tue, Feb 18, 2014, at 12:38, Miquel Esplà wrote:
> >
> >  Hi Per,
> >
> > I think that the explanation in this website:
> > [6]http://rali.iro.umontreal.ca/rali/?q=en/node/1325 is quite
useful. It
> > helps a lot to understand the structure and the content of each
file
> > generated by OmegaT.
> >
> > About the script, in the last release of bitextor we included a
script
> > called "bitextor-builddics" (you can find the template of this
script here:
> >
[7]https://svn.code.sf.net/p/bitextor/code/trunk/bitextor-builddics.in)
> > which uses GIZA++ to obtain a plain text bilingual dictionary, but
only
> > including pairs of words fulfilling: a) both words occur at least
10 times
> > in the corpus, and b) the harmonic mean of their probabilities in
both
> > probabilistic dictionaries (S -> T and T -> S) is higher than 0.2.
If you
> > want to use this, I recommend you to use the version in the trunk,
which
> > fixes some minor bugs still present in the release.
> >
> > Best,
> >
> > Miquel.
> >--snip---

  --------------------------------------------------------------------
  ----------
  Subversion Kills Productivity. Get off Subversion & Make the Move to
  Perforce.
  With Perforce, you get hassle-free workflows. Merge that actually
  works.
  Faster operations. Version large binaries.  Built-in WAN
  optimization and the
  freedom to use Git, Perforce or both. Make the move to Perforce.
  [8]http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/
  ostg.clktrk
  _______________________________________________
  Apertium-stuff mailing list
  [9][email protected]
  [10]https://lists.sourceforge.net/lists/listinfo/apertium-stuff


-----------------------------------------------------------------------
-------

Subversion Kills Productivity. Get off Subversion & Make the Move to
Perforce.

With Perforce, you get hassle-free workflows. Merge that actually
works.

Faster operations. Version large binaries.  Built-in WAN optimization
and the

freedom to use Git, Perforce or both. Make the move to Perforce.

[11]http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/os
tg.clktrk

_______________________________________________

Apertium-stuff mailing list

[12][email protected]

[13]https://lists.sourceforge.net/lists/listinfo/apertium-stuff

References

1. 
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl
2. 
https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes
3. 
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.sv
4. mailto:[email protected]
5. http://bitextor-builddics.in/
6. http://rali.iro.umontreal.ca/rali/?q=en/node/1325
7. https://svn.code.sf.net/p/bitextor/code/trunk/bitextor-builddics.in
8. http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
9. mailto:[email protected]
  10. https://lists.sourceforge.net/lists/listinfo/apertium-stuff
  11. 
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
  12. mailto:[email protected]
  13. https://lists.sourceforge.net/lists/listinfo/apertium-stuff
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to