Re: [Apertium-stuff] Frequence list of non-translated words

Francis Tyers Sun, 11 Nov 2012 13:31:04 -0800

El dg 11 de 11 de 2012 a les 22:23 +0100, en/na Per Tunedal va escriure:
> Hi,
> I have tried translating some texts and got the translation in a large 
> text file with all the error codes. I would like a frequency list for
> the words that get a certain error.
> 
> Example:
> 
> Odenses *infrastrukur är präglad av @beliggenhed vid Odense Kanal, som
> *forbinder Odense Hamn med Odense Fjord. Den blev byggd i @åre omkring
> år 1800 och ger entré från vattnet till stadens centrum. *Herudover har
> den #ha betydelse for @infrastruktur vid placeringen av
> *kraftvarmeværket *Fynsværket och den tidigare *losseplads på Stege Ö.
> 
> I looked at the page:
> 
> http://wiki.apertium.org/wiki/One-liners
> 
> and found the scripts:
> 
>      Get unknown words from chunked text and sort by frequency: 
> 
> sed 's/\$\W*\^/$\n^/g' | grep '@' | sed 's/><.*/>$/g' |  sort -f | uniq
> -ci  | sort -gr
> 
> tr " " "\n" | grep "@" | tr -d "[:punct:]" | sort | uniq -c | sort -r
> 
> But, unfortunately I cannot understand how to use them. How to enter the
> input and output file?


Try this:

$ cat ~/corpora/north_germanic_bibles/bible.da/book001.chapter001.txt |
apertium -d . da-sv-biltrans | sed 's/\$\W*\^/$\n^/g' | grep '@' | sort
-f | uniq -ci | sort -gr 

Where the file after 'cat' is the corpus you want to use.

> BTW What's the scripting language?

That's bash.

Fran


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Frequence list of non-translated words

Reply via email to