Re: [Apertium-stuff] Polish - Silesian pair

Grzegorz Kulik Fri, 01 Jun 2018 05:07:27 -0700


On 01.06.2018 00:14, Francis Tyers wrote:

El 2018-05-31 23:36, Grzegorz Kulik escribió:

Okay, I've transferred apertium-szl and apertium-pol-szl to Apertium
on Github.


Great, I've made a couple of changes to the apertium-szl


Great, thanks!

I want to sort the Polish dictionary too but I see tens of
errors in the one on your side. Has anybody ever tried to compile it?
There are dozens of duplicated key sequences, and all those thousands
of machine-generated words that end with -ość are generated
improperly:

<e lm="realizacyjność"><i>realizacyjność</i><parn="miłoś/ć__n"/></e>


while it should be:

    <e lm="realizacyjność"><i>realizacyjnoś</i><par
n="miłoś/ć__n"/></e> - no ć between the <i> tags

But not only that. All the words after the line 24435 (the -ość ones),
they don't exist. And they make up around 2/3 of the dictionary. I
understand it was easy for someone to machine generate them by adding
prefixes and suffixes to actual words but let me translate some of the
first twenty ones for you:

nieobżartość - un-fed-up-ness
niepozauranowość - un-outside-uranium-ness
niebiałoramienność - un-white-arm-ness
niejeżowość - un-hedgehog-ness
nieprzysłoneczność - un-at-sun-ness
niewołowatość - un-mule-ness

If you delete those, you're left with mere 15792 entries. Wouldn't it
be better to just use the dictionary I made manually? It's proven to
work, it's twice as large, and it was built on the Polish monodix
found on the SVN in January 2016, i just got rid of errors and added
entries.

What do you think?


I think that sounds fine to me. There is pol-ces in the staging/ part
but as far as I know there has been no released pair with Polish yet.

Jim, what do you think ?

In general, if you're willing to maintain it, I'd say that given
there are no other released pairs yet, you should get priority
to decide what content it has.

At some point I might even try to develop the Polish - Czech pair. Sincethere are people interested in the Silesian - Czech one then developingthe Polish - Czech pair further should be easier, too. But obviouslySilesian pairs are priority to me.


Have you calculated the coverage for both dictionaries ?

Never thought about it, so I put together an ad hoc Polish corpus madefrom random Wikipedia articles and did the steps explained in the Wiki.This is what i got:

79.265 % known tokens (543957 unknown, 0 bidix-unknown of total 2623413tokens)

To check Silesian - Polish coverage I used some texts from the Silesiancorpus I'm currently working on. I got:

79.132 % known tokens (37604 unknown, 0 bidix-unknown of total 180197tokens)

We actually have many more texts but they use either phonetical Polishorthography, or a simplified version of the one that is used in thetranslator. The reason is that we agreed on the orthography just in 2009and people are allowed to drop three diacritical letters because 90% ofthe population don't actually need them. I figured the translator shoulduse the full set of letters since it's easier to get rid of diacriticsthan the other way around.


Another option would be to keep the old apertium-pol in a branch,
and copy yours in as master.

Fran


Okay, I'll wait for you to decide.

Greg

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Polish - Silesian pair

Reply via email to