On 01.06.2018 00:14, Francis Tyers wrote:
El 2018-05-31 23:36, Grzegorz Kulik escribió:
Okay, I've transferred apertium-szl and apertium-pol-szl to Apertium
on Github.

Great, I've made a couple of changes to the apertium-szl

Great, thanks!


I want to sort the Polish dictionary too but I see tens of
errors in the one on your side. Has anybody ever tried to compile it?
There are dozens of duplicated key sequences, and all those thousands
of machine-generated words that end with -ość are generated
improperly:

    <e lm="realizacyjność"><i>realizacyjność</i><par n="miłoś/ć__n"/></e>

while it should be:

    <e lm="realizacyjność"><i>realizacyjnoś</i><par
n="miłoś/ć__n"/></e> - no ć between the <i> tags

But not only that. All the words after the line 24435 (the -ość ones),
they don't exist. And they make up around 2/3 of the dictionary. I
understand it was easy for someone to machine generate them by adding
prefixes and suffixes to actual words but let me translate some of the
first twenty ones for you:

nieobżartość - un-fed-up-ness
niepozauranowość - un-outside-uranium-ness
niebiałoramienność - un-white-arm-ness
niejeżowość - un-hedgehog-ness
nieprzysłoneczność - un-at-sun-ness
niewołowatość - un-mule-ness

If you delete those, you're left with mere 15792 entries. Wouldn't it
be better to just use the dictionary I made manually? It's proven to
work, it's twice as large, and it was built on the Polish monodix
found on the SVN in January 2016, i just got rid of errors and added
entries.

What do you think?

I think that sounds fine to me. There is pol-ces in the staging/ part
but as far as I know there has been no released pair with Polish yet.

Jim, what do you think ?

In general, if you're willing to maintain it, I'd say that given
there are no other released pairs yet, you should get priority
to decide what content it has.

At some point I might even try to develop the Polish - Czech pair. Since there are people interested in the Silesian - Czech one then developing the Polish - Czech pair further should be easier, too. But obviously Silesian pairs are priority to me.


Have you calculated the coverage for both dictionaries ?

Never thought about it, so I put together an ad hoc Polish corpus made from random Wikipedia articles and did the steps explained in the Wiki. This is what i got:

79.265 % known tokens (543957 unknown, 0 bidix-unknown of total 2623413 tokens)

To check Silesian - Polish coverage I used some texts from the Silesian corpus I'm currently working on. I got:

79.132 % known tokens (37604 unknown, 0 bidix-unknown of total 180197 tokens)

We actually have many more texts but they use either phonetical Polish orthography, or a simplified version of the one that is used in the translator. The reason is that we agreed on the orthography just in 2009 and people are allowed to drop three diacritical letters because 90% of the population don't actually need them. I figured the translator should use the full set of letters since it's easier to get rid of diacritics than the other way around.


Another option would be to keep the old apertium-pol in a branch,
and copy yours in as master.

Fran

Okay, I'll wait for you to decide.

Greg

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to