[Apertium-stuff] Words frequency and probability of lexical unit

Jonathan Demeyer Tue, 06 Jun 2017 07:25:12 -0700

Hello,

TL;DR:
Is there a way to give a probability to a lemma ?
e.g. : "flies"
What I can get with lttoolbox is :
fly<n><pl>/fly<vblex><pres><p3><sg>


I need something more like this:
fly<n><pl> 10%
fly<vblex><pres><p3><sg> 90%

I have found a .prob file for each pair of languages but I don't know what
it does.

---
Longer version:

I want to make a word frequency based on the corpus of opensubtitle. Up to
now, words frequency list that I found are counting lemmas. I want to sort
lexical units.
Here is my work in progress for the 50k first lemmas in French (lemmas,
occurrences and possible lexical units) :
https://www.dropbox.com/s/0htj0g3s2b07sqq/LexUnitsFrResult50k.txt?dl=0

The next step is to split the number associated with lemma which can come
from different lexical units :
"suis 2178106 être,suivre" to the "être" part and to the "suivre" part and
split the "2178106" accordingly (roughly).

Is it possible to do that with Apertium ? I'm working on the French
language because it's my mother tongue but I'm trying to be as generic as
possible.

I intend to publish all my scripts on Github : https://github.com/jonadem
(not much yet, I try to get what I want first).

Thank you,

Jonathan Demeyer

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Words frequency and probability of lexical unit

Reply via email to