Re: readable POS tags
W dniu 2014-05-07 19:56, Daniel Naber pisze: > On 2014-05-07 19:07, Marcin Miłkowski wrote: > >> unification. I still don't get why German doesn't use it for >> disambiguation, for example. > Maybe because nobody has seen an urgent need for that yet. I don't work > that much on the German rules, but I'm generally okay with the way they > work. Well, after I implement chunking via disambiguation, unification could easily mark up nominal phrases composed of the article, adjectives, and the noun. It's also useful to discard ambiguous readings before synthesizing a sensible suggestion. Regards, Marcin -- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: • 3 signs your SCM is hindering your productivity • Requirements for releasing software faster • Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 2014-05-07 19:07, Marcin Miłkowski wrote: > unification. I still don't get why German doesn't use it for > disambiguation, for example. Maybe because nobody has seen an urgent need for that yet. I don't work that much on the German rules, but I'm generally okay with the way they work. Regards Daniel -- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: • 3 signs your SCM is hindering your productivity • Requirements for releasing software faster • Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
W dniu 2014-05-07 12:10, Daniel Naber pisze:
> On 2014-04-08 14:44, Daniel Naber wrote:
>
>> I have now added a
>> branch ("readable-pos-tags") for this, simply because the changes are
>> getting so complex. It's still incomplete and buggy.
>
> As you may have noticed, I did some work in this branch. You can see it
> at
> https://github.com/languagetool-org/languagetool/tree/readable-pos-tags
>
> Although it basically works for English and German, the changes have not
> been merged back to the master branch as I'm not happy with them.
> Writing a class that turns the internal POS tags (like "NN") into
> structured POS tags (like "pos=noun, number=singular") isn't very
> complicated, but still quite some work and it's obviously
> language-specific. These classes should be developed by people who
> actually speak the language. I'm not sure if that would actually happen
> so we might have several languages that only support the old POS tags
> for years and I'd like to avoid that.
We might, but that's the general principle for other features such as
unification. I still don't get why German doesn't use it for
disambiguation, for example. I could write up some simple rules to leave
only token readings that agree with each other.
>
> Then there's the general problem that we cannot move all old POS tags to
> the new ones. It's not possible to do automatically, and it's also not
> desirable, as sometimes the old POS tags are much more compact. So we'd
> have two ways to do the same thing, basically forever.
I don't see it as particularly wrong. For languages that use the
Unifier, we have to run regexes multiple times on the same token, and
that slows processing down. With attributes, we could make it much
faster. So this dual route would actually speed up Catalan, French, and
Polish (and maybe other languages as well).
Regards,
Marcin
>
> So for now, I will keep the code in that branch and not merge it...
>
> Regards
>Daniel
>
>
> --
> Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
> • 3 signs your SCM is hindering your productivity
> • Requirements for releasing software faster
> • Expert tips and advice for migrating your SCM now
> http://p.sf.net/sfu/perforce
> ___
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>
--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 2014-04-08 14:44, Daniel Naber wrote:
> I have now added a
> branch ("readable-pos-tags") for this, simply because the changes are
> getting so complex. It's still incomplete and buggy.
As you may have noticed, I did some work in this branch. You can see it
at
https://github.com/languagetool-org/languagetool/tree/readable-pos-tags
Although it basically works for English and German, the changes have not
been merged back to the master branch as I'm not happy with them.
Writing a class that turns the internal POS tags (like "NN") into
structured POS tags (like "pos=noun, number=singular") isn't very
complicated, but still quite some work and it's obviously
language-specific. These classes should be developed by people who
actually speak the language. I'm not sure if that would actually happen
so we might have several languages that only support the old POS tags
for years and I'd like to avoid that.
Then there's the general problem that we cannot move all old POS tags to
the new ones. It's not possible to do automatically, and it's also not
desirable, as sometimes the old POS tags are much more compact. So we'd
have two ways to do the same thing, basically forever.
So for now, I will keep the code in that branch and not merge it...
Regards
Daniel
--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
XSD and namespaces (was: readable POS tags)
W dniu 2014-04-09 18:41, Marcin Miłkowski pisze: > W dniu 2014-04-09 16:31, Daniel Naber pisze: >> On 2014-04-09 09:44, Marcin Miłkowski wrote: >> >>> So I'm not sure what is the problem. Basically, the number of >>> elements have to correspond to the number of tokens inside the >>> element. And that's it. >> >> It's working now in the sense that the tests work and one English >> example rule (CONFUSION_OF_OUR_OUT) use the new tags. >> >> Now how exactly do we map the Penn Tagset to the new structure? Here's >> the current mapping with a lot of TODOs, let me know if you have an idea >> how to get it right: >> >> https://github.com/languagetool-org/languagetool/blob/readable-pos-tags/languagetool-language-modules/en/src/main/java/org/languagetool/tagging/en/EnglishTagger.java >> >> Also, have you used namespaces on attributes in XSD? If so, could you >> provide a small example on how to write the XSD so that > en:tense="simple_past"/> can be validated? > > No, but Dave Pawson is XML expert. I guess he will be happy to help. And here's a tutorial: http://www.liquid-technologies.com/Tutorials/XmlSchemas/XsdTutorial_04.aspx Regards, Marcin > > Regards, > Marcin > > -- > Put Bad Developers to Shame > Dominate Development with Jenkins Continuous Integration > Continuously Automate Build, Test & Deployment > Start a new project now. Try Jenkins in the cloud. > http://p.sf.net/sfu/13600_Cloudbees > ___ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > -- Put Bad Developers to Shame Dominate Development with Jenkins Continuous Integration Continuously Automate Build, Test & Deployment Start a new project now. Try Jenkins in the cloud. http://p.sf.net/sfu/13600_Cloudbees ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
W dniu 2014-04-09 16:31, Daniel Naber pisze: > On 2014-04-09 09:44, Marcin Miłkowski wrote: > >> So I'm not sure what is the problem. Basically, the number of >> elements have to correspond to the number of tokens inside the >> element. And that's it. > > It's working now in the sense that the tests work and one English > example rule (CONFUSION_OF_OUR_OUT) use the new tags. > > Now how exactly do we map the Penn Tagset to the new structure? Here's > the current mapping with a lot of TODOs, let me know if you have an idea > how to get it right: > > https://github.com/languagetool-org/languagetool/blob/readable-pos-tags/languagetool-language-modules/en/src/main/java/org/languagetool/tagging/en/EnglishTagger.java > > Also, have you used namespaces on attributes in XSD? If so, could you > provide a small example on how to write the XSD so that en:tense="simple_past"/> can be validated? No, but Dave Pawson is XML expert. I guess he will be happy to help. Regards, Marcin -- Put Bad Developers to Shame Dominate Development with Jenkins Continuous Integration Continuously Automate Build, Test & Deployment Start a new project now. Try Jenkins in the cloud. http://p.sf.net/sfu/13600_Cloudbees ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 2014-04-09 09:44, Marcin Miłkowski wrote: > So I'm not sure what is the problem. Basically, the number of > elements have to correspond to the number of tokens inside the > element. And that's it. It's working now in the sense that the tests work and one English example rule (CONFUSION_OF_OUR_OUT) use the new tags. Now how exactly do we map the Penn Tagset to the new structure? Here's the current mapping with a lot of TODOs, let me know if you have an idea how to get it right: https://github.com/languagetool-org/languagetool/blob/readable-pos-tags/languagetool-language-modules/en/src/main/java/org/languagetool/tagging/en/EnglishTagger.java Also, have you used namespaces on attributes in XSD? If so, could you provide a small example on how to write the XSD so that can be validated? Regards Daniel -- Put Bad Developers to Shame Dominate Development with Jenkins Continuous Integration Continuously Automate Build, Test & Deployment Start a new project now. Try Jenkins in the cloud. http://p.sf.net/sfu/13600_Cloudbees ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
W dniu 2014-04-08 14:44, Daniel Naber pisze:
> On 2014-04-08 08:43, Marcin Miłkowski wrote:
>
>>> Internally, we now have information like this: postag=VBD, pos=verb,
>>> tense=past (etc.). But the disambiguation only works on the old tag? I
>>> guess I will need to resolve VBD here so the action works on both the
>>> old and the new representation?
>>
>> I think the only thing needed is to parse the tags again, if they are
>> different.
>
> Although I'm not sure if I understood what you meant, I have now added a
> branch ("readable-pos-tags") for this, simply because the changes are
> getting so complex. It's still incomplete and buggy.
>
> Here's the basic idea of my changes in that branch: class TokenPoS is
> the new structured representation of POS tags. EnglishTagger returns one
> or more TokenPoS for a given traditional POS tag (like NNS). More than
> one will be returned in cases that are ambiguous in the new
> representation, e.g. "walk/VBP" can be person=1|2 number=singular and
> person=1|2|3 person=plural. Each AnalyzedToken has one TokenPoS.
>
> Currently the problem is this (when running the tests):
> Caused by: org.xml.sax.SAXException: English rule error. The number of
> interpretations specified with wd: 5 must be equal to the number of
> matched tokens (1)
>Line: 1525, column: 12.
>
> I roughly understand what the problem is but not yet the solution... any
> help is welcome, also any hints that what I'm doing in that branch might
> be wrong.
Hm, the line 1525 is:
So I'm not sure what is the problem. Basically, the number of
elements have to correspond to the number of tokens inside the
element. And that's it.
I thought that TokenPoS should be initialized after tagging and then any
time the disambiguator makes the change to the AnalyzedToken.posTag. So
whenever the disambiguator changes the values of the token, you need to
make sure that the TokenPoS is up to date by re-running the POS tag
parser. What's the problem with this approach? I assume that we are not
talking about disambiguation rules (yet) that change TokenPoS only. This
is a bit more tricky, as some tags might be more ambiguous than TokenPoS
values, so we'd have to leave those ambiguous tags and prune or change
only TokenPoS values. Luckily, I think only the Penn tagset is so
ambiguous. Structured tagsets (such as German Morphy or Polish one)
should be much easier to interpret.
Regards,
Marcin
--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 2014-04-08 08:43, Marcin Miłkowski wrote:
>> Internally, we now have information like this: postag=VBD, pos=verb,
>> tense=past (etc.). But the disambiguation only works on the old tag? I
>> guess I will need to resolve VBD here so the action works on both the
>> old and the new representation?
>
> I think the only thing needed is to parse the tags again, if they are
> different.
Although I'm not sure if I understood what you meant, I have now added a
branch ("readable-pos-tags") for this, simply because the changes are
getting so complex. It's still incomplete and buggy.
Here's the basic idea of my changes in that branch: class TokenPoS is
the new structured representation of POS tags. EnglishTagger returns one
or more TokenPoS for a given traditional POS tag (like NNS). More than
one will be returned in cases that are ambiguous in the new
representation, e.g. "walk/VBP" can be person=1|2 number=singular and
person=1|2|3 person=plural. Each AnalyzedToken has one TokenPoS.
Currently the problem is this (when running the tests):
Caused by: org.xml.sax.SAXException: English rule error. The number of
interpretations specified with wd: 5 must be equal to the number of
matched tokens (1)
Line: 1525, column: 12.
I roughly understand what the problem is but not yet the solution... any
help is welcome, also any hints that what I'm doing in that branch might
be wrong.
Regards
Daniel
--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
W dniu 2014-04-07 23:01, Daniel Naber pisze: > On 2014-03-25 09:35, Daniel Naber wrote: > >> I've written an overview of how we could use readable POS tags in LT: >> >> http://wiki.languagetool.org/readable-part-of-speech-tags > I'm writing a prototypical implementation on this for English now. But > there's one point where I'm stuck. Maybe I'm missing something obvious. > Everything is fine for grammar.xml: we have new tags but keep the old > ones, both work. But what about disambiguation.xml? What does it now > mean to have something like this: > > > > Internally, we now have information like this: postag=VBD, pos=verb, > tense=past (etc.). But the disambiguation only works on the old tag? I > guess I will need to resolve VBD here so the action works on both the > old and the new representation? I think the only thing needed is to parse the tags again, if they are different. > What if there's an action like this? > Will I need to expand the 'JJ.?' against all known tags and then apply > the change to the resolved (new) representation? > > Again, parse the tags again, discard the old info. Regards, Marcin > > Regards >Daniel > > > -- > Put Bad Developers to Shame > Dominate Development with Jenkins Continuous Integration > Continuously Automate Build, Test & Deployment > Start a new project now. Try Jenkins in the cloud. > http://p.sf.net/sfu/13600_Cloudbees > ___ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > -- Put Bad Developers to Shame Dominate Development with Jenkins Continuous Integration Continuously Automate Build, Test & Deployment Start a new project now. Try Jenkins in the cloud. http://p.sf.net/sfu/13600_Cloudbees ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 2014-03-25 09:35, Daniel Naber wrote: > I've written an overview of how we could use readable POS tags in LT: > > http://wiki.languagetool.org/readable-part-of-speech-tags I'm writing a prototypical implementation on this for English now. But there's one point where I'm stuck. Maybe I'm missing something obvious. Everything is fine for grammar.xml: we have new tags but keep the old ones, both work. But what about disambiguation.xml? What does it now mean to have something like this: Internally, we now have information like this: postag=VBD, pos=verb, tense=past (etc.). But the disambiguation only works on the old tag? I guess I will need to resolve VBD here so the action works on both the old and the new representation? What if there's an action like this? Will I need to expand the 'JJ.?' against all known tags and then apply the change to the resolved (new) representation? Regards Daniel -- Put Bad Developers to Shame Dominate Development with Jenkins Continuous Integration Continuously Automate Build, Test & Deployment Start a new project now. Try Jenkins in the cloud. http://p.sf.net/sfu/13600_Cloudbees ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 2014-03-26 17:49, Marcin Miłkowski wrote: > No, that would be horrible, as this is not an improvement. The problem > is not that tags are cryptic and short; That's also a problem, but not so much for power users and for everybody else we will be able to solve that in the user interface (i.e. rule editor). > So for the word tagged as VBP we could have > > > > or > > That makes sense. Regards Daniel -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
W dniu 2014-03-26 17:06, Dave Pawson pisze:
> On 26 March 2014 12:51, Daniel Naber wrote:
>
>> Any ideas on how the VBP tag (in English) might fit into this approach,
>> i.e. "not 3rd person singular"? Will we need to introduce a tag like
>> "pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.
>>
>> Internally, it could be expanded to mean:
>> [{pos=verb, person=1|2, number=singular, tense=preset},
>>{pos=verb, person=1|2|3, number=plural, tense=preset}]
>
>
> I have a natural dislike for negatives, so person=1|2 IMHO is the better
> option?
> pos="1or2singular" perhaps?
>
> for {pos=verb, person=1|2|3, number=plural, tense=preset}] is there
> redundancy?
> If 1,2 or 3 are included it is not necessary?
> pos="plural,present" (is preset a typo?)
> {pos=verb, number=plural, tense=preset}]
>
> Is there no word for singular OR plural?
Fortunately, no :)
Anyway, see my reply to Daniel. I don't think "Not3rdblablah" is useful
at all.
Regards,
Marcin
--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
W dniu 2014-03-26 15:20, Mike Unwalla pisze: > I agree that backward compatibility is important. Without backward > compatibility, the proposed change means that the content of disambiguation > files and grammar files must be changed. That is a huge task. > > Even if you develop a utility that lets people convert files to the new > format, there remains a problem of the conversion of non-standard postags. > For example, 'Adding only POS tags or tokens' shows how to add a > non-standard postag 'UP' > (http://wiki.languagetool.org/developing-a-disambiguator#toc8'). What will > be the effect of the proposed change on non-standard postags? I am afraid that parsing these new tags would be difficult so we would need to define new attributes and values in the disambiguator; but it would be difficult to specify those in the rules as rules are written in static XML so they cannot use a construct that is defined outside the schema, and we'd need to modify the schema. But as we would retain the old postag interface (I think this is a must!) this would not be a problem. Regards, Marcin > > Regards, > > Mike Unwalla > Contact: www.techscribe.co.uk/techw/contact.htm > > -Original Message- > From: Dominique Pellé [mailto:[email protected]] > > > In any case, we need to preserve rule backward compatibility. > I cannot imagine having to change manually all rules in > all languages, at least not manually. I would be a lot of > error prone work. > > > > -- > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > ___ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
W dniu 2014-03-26 13:51, Daniel Naber pisze:
> On 2014-03-25 14:24, Marcin Miłkowski wrote:
>
>>> So instead of just adding the POS tag we get from Morfologik to our
>>> AnalyzedToken object as a string, we interpret it and store something
>>> like pos = preposition, case = accusative. Is it that what you mean?
>>
>> Exactly.
>
> Any ideas on how the VBP tag (in English) might fit into this approach,
> i.e. "not 3rd person singular"? Will we need to introduce a tag like
> "pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.
No, that would be horrible, as this is not an improvement. The problem
is not that tags are cryptic and short; it is that they do not make
features easily available separately.
My use case for readable pos tags is also speed and simplicity for
unification (rules that use agreement between words). It is simply
faster to specify features by citing appropriate attributes that can be
processed once instead of running a regexp every time the sentence is
processed in a unification rule. For Catalan, Polish, and French this
will be a huge time improvement.
Now, for this to work the attributes should be specified just like they
are in Corpus Query Language (CQL).
>
> Internally, it could be expanded to mean:
> [{pos=verb, person=1|2, number=singular, tense=preset},
>{pos=verb, person=1|2|3, number=plural, tense=preset}]
So for the word tagged as VBP we could have
or
Both would match a word with VBP. (Note that the disambiguator could
even remove one of the interpretations to make it clear that this is a
plural use of the token!)
Above, I used a mixture of attributes without namespaces (these would be
universal for all languages) and ones with namespaces, like tense, which
is not present in all languages. We can look at proposed Universal
Tagset to find universal categories:
https://code.google.com/p/universal-pos-tags/
Note also that one could write:
And this would be equivalent to:
But possibly a lot faster. The new syntax comes out also much easier to
read, and would be equivalent to CQL query:
[pos="verb"]
Similarly for words in comparative degree, where you have to use now
(for English):
You could simply say:
Basically, by making attributes separate we could have a much easier way
to write complex rules without problems as to how specify POS tags. I
consider myself to be a power user but with a complex Polish tagset it
is sometimes really difficult to specify the features I want using
regexes: the tagset itself creates pretty complex and lengthy strings
and a lot of time is needed to make sure that the regex matches.
Regards,
Marcin
>
> Regards
>Daniel
>
>
> --
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> ___
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 26 March 2014 12:51, Daniel Naber wrote:
> Any ideas on how the VBP tag (in English) might fit into this approach,
> i.e. "not 3rd person singular"? Will we need to introduce a tag like
> "pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.
>
> Internally, it could be expanded to mean:
> [{pos=verb, person=1|2, number=singular, tense=preset},
> {pos=verb, person=1|2|3, number=plural, tense=preset}]
I have a natural dislike for negatives, so person=1|2 IMHO is the better option?
pos="1or2singular" perhaps?
for {pos=verb, person=1|2|3, number=plural, tense=preset}] is there redundancy?
If 1,2 or 3 are included it is not necessary?
pos="plural,present" (is preset a typo?)
{pos=verb, number=plural, tense=preset}]
Is there no word for singular OR plural?
regards
--
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk
--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
RE: readable POS tags
I agree that backward compatibility is important. Without backward compatibility, the proposed change means that the content of disambiguation files and grammar files must be changed. That is a huge task. Even if you develop a utility that lets people convert files to the new format, there remains a problem of the conversion of non-standard postags. For example, 'Adding only POS tags or tokens' shows how to add a non-standard postag 'UP' (http://wiki.languagetool.org/developing-a-disambiguator#toc8'). What will be the effect of the proposed change on non-standard postags? Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Dominique Pellé [mailto:[email protected]] In any case, we need to preserve rule backward compatibility. I cannot imagine having to change manually all rules in all languages, at least not manually. I would be a lot of error prone work. -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 2014-03-25 14:24, Marcin Miłkowski wrote:
>> So instead of just adding the POS tag we get from Morfologik to our
>> AnalyzedToken object as a string, we interpret it and store something
>> like pos = preposition, case = accusative. Is it that what you mean?
>
> Exactly.
Any ideas on how the VBP tag (in English) might fit into this approach,
i.e. "not 3rd person singular"? Will we need to introduce a tag like
"pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.
Internally, it could be expanded to mean:
[{pos=verb, person=1|2, number=singular, tense=preset},
{pos=verb, person=1|2|3, number=plural, tense=preset}]
Regards
Daniel
--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
About the difficulties in learning LT.. I started out long ago for LT as a IT pro, language hobbyist. The help of Daniel and Marcin at that time made it possible to get started. Starting with XML and plain words rules was not that difficult. Trial and error, and some help from this list was enough. Complexity rose when postags had to be made. That takes a lot of language knowledge, but most complex is to get the software in place to generate the lists. In general, there is a lot of focus on programming the programming tool level. (EDI, GIT etc.) Too complex for non-programmers. Making things 'wysiwyg' and webbased is a good direction. (Example: Translate the site content to any language from the site itself? Edit postags and words for any language online, auto generation of the dictionaries. ) Maybe there should be a 'programming' focus, a 'language' focus, and a 'rule' focus area? Just my 2 cents. Ruud -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 26 March 2014 10:49, Daniel Naber wrote: > On 2014-03-25 21:59, Dominique Pellé wrote: > >> power compared to using regexp. Power users know regexp >> well as they are used in many programs so they don't have to >> learn something new. Power users also like the conciseness >> of regexp. > > As you said, the old way of matching will still be there for > compatibility reasons. I see your point about power users. The thing is, > how does one become a power user? As it is now, someone might want to > contribute rules without having technical knowledge. They would need to > learn XML, regular expressions and the LT matching logic all at once. > This is quite a barrier and I guess its one of the reasons we don't get > contributions for the several languages that are not maintained. That's > why I think readable tags and the online editor are important - if > somebody doesn't contribute at all, they won't ever become a power user. Very good logic! How to encourage new users without insulting them? Newbies start here, Power users use this? Totally wrong. How about Plain/normal rule generation and 'advanced' rule generation? I agree both are needed. regards -- Dave Pawson XSLT XSL-FO FAQ. Docbook FAQ. http://www.dpawson.co.uk -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 2014-03-25 21:59, Dominique Pellé wrote: > power compared to using regexp. Power users know regexp > well as they are used in many programs so they don't have to > learn something new. Power users also like the conciseness > of regexp. As you said, the old way of matching will still be there for compatibility reasons. I see your point about power users. The thing is, how does one become a power user? As it is now, someone might want to contribute rules without having technical knowledge. They would need to learn XML, regular expressions and the LT matching logic all at once. This is quite a barrier and I guess its one of the reasons we don't get contributions for the several languages that are not maintained. That's why I think readable tags and the online editor are important - if somebody doesn't contribute at all, they won't ever become a power user. Regards Daniel -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
Daniel Naber wrote: > Hi, > > I've written an overview of how we could use readable POS tags in LT: > > http://wiki.languagetool.org/readable-part-of-speech-tags > > The core part however - how do these new POS tags actually look like - > is still missing. Any input on the overview and ideas about that core > part is welcome. > > Regards > Daniel I'm not convinced it will simplify things. In fact I suspect it would make rules more obscure, more verbose and possibly lose power compared to using regexp. Power users know regexp well as they are used in many programs so they don't have to learn something new. Power users also like the conciseness of regexp. Anybody who wants to seriously contributes to LT needs to understand regexp anyway (not just for POS tags). And knowing regexp is useful for plenty of other programs. A good documentation of POS tags for each language should suffice in my opinion. POS tags are also very different in each language and for good reasons as grammar of the languages can be quite different. In any case, we need to preserve rule backward compatibility. I cannot imagine having to change manually all rules in all languages, at least not manually. I would be a lot of error prone work. On the other hand, rule matching may be faster without regexp. Regards Dominique -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
W dniu 2014-03-25 13:29, Daniel Naber pisze: > On 2014-03-25 11:07, Marcin Miłkowski wrote: > >> For all I can see, no HashMaps are required at all, just a consistent >> way of understanding the values in class members. > So instead of just adding the POS tag we get from Morfologik to our > AnalyzedToken object as a string, we interpret it and store something > like pos = preposition, case = accusative. Is it that what you mean? Exactly. We pay the computational price just once, during parsing the tag, and given that most tags are pretty nicely structured (except Penn!), we could parse them very quickly. Yet I would still store the string as well: we use it for the synthesizer and changing the synthesizer is not as trivial as parsing tags (we might, of course, recreate the tag from the keys and values but this is just additional computational overhead). Also, some rules may be hard to express without regexes on POS tags (we will still need to regex in case of a disjunction of several different POS tags, I'm afraid, but these regexes will be quite rare). > Then indeed 'mapping' isn't a good term for that. We could make all keys > and values enums, so we have type safety and don't have to deal with > strings. Do we need type safety so that you cannot use a Polish-only > value for German even in Java code? I think that may not be needed. As long as it's enforced in XML namespaces (which should be doable), we should be fine. Developers should know what they're doing. LOL ;) Regards, Marcin > > Regards >Daniel > > > -- > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > ___ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 2014-03-25 11:07, Marcin Miłkowski wrote: > For all I can see, no HashMaps are required at all, just a consistent > way of understanding the values in class members. So instead of just adding the POS tag we get from Morfologik to our AnalyzedToken object as a string, we interpret it and store something like pos = preposition, case = accusative. Is it that what you mean? Then indeed 'mapping' isn't a good term for that. We could make all keys and values enums, so we have type safety and don't have to deal with strings. Do we need type safety so that you cannot use a Polish-only value for German even in Java code? I think that may not be needed. Regards Daniel -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
On 25 March 2014 08:35, Daniel Naber wrote: > Hi, > > I've written an overview of how we could use readable POS tags in LT: > > http://wiki.languagetool.org/readable-part-of-speech-tags > > The core part however - how do these new POS tags actually look like - > is still missing. Any input on the overview and ideas about that core > part is welcome. A glossary, but only if this document is meant for non-grammarians? E.g. "instead of DET, use: determiner" term not understood. Options for tags http://en.wikipedia.org/wiki/Part-of-speech_tagging or http://www.cis.upenn.edu/~treebank/ is said to be more complete than most. HTH -- Dave Pawson XSLT XSL-FO FAQ. Docbook FAQ. http://www.dpawson.co.uk -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: readable POS tags
W dniu 2014-03-25 09:35, Daniel Naber pisze: > Hi, > > I've written an overview of how we could use readable POS tags in LT: > > http://wiki.languagetool.org/readable-part-of-speech-tags > > The core part however - how do these new POS tags actually look like - > is still missing. Any input on the overview and ideas about that core > part is welcome. I think that we should not change the existing POS tags, as there are features in some languages that are not found in others. For example, Polish verbs have perfective or imperfective aspect, and there are also reflexive verbs and partially nonreflexive verbs, special agglutinates etc. I don't think it will be easy to find a superset of all possible features needed, even by using ISOcat, also because I introduced some helper POS tags for rules myself. For this reason, I think that values and keys should be configurable per tagset. Also, if someone uses a feature for Polish in grammar file for English, it should be automatically disallowed. We can enforce this configurability using XML namespaces. At the same time, I don't think that mapping is ever required. What we need is one-time parsing of POS tags into key-value pairs, which basically boils down to storing some values in class members. Then a standard getter would be enough, and that is really computationally cheap. Let me give an example. This is a POS tag for a preposition that requires accusative: prep:acc We would store the following values: pos = preposition case = accusative A slightly more complex problem is that for some tags, we get alternative readings (due to syntactic ambiguity), so we might have a an adjective that shares its form in accusative and nominative. One very easy way to deal with this is to use simple binary operations on constants (think of cases as bit flags). So a very easy binary operation would be enough to check whether the values are there. For all I can see, no HashMaps are required at all, just a consistent way of understanding the values in class members. Regards, Marcin -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
readable POS tags
Hi, I've written an overview of how we could use readable POS tags in LT: http://wiki.languagetool.org/readable-part-of-speech-tags The core part however - how do these new POS tags actually look like - is still missing. Any input on the overview and ideas about that core part is welcome. Regards Daniel -- Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech ___ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
