Re: readable POS tags

2014-05-07 Thread Daniel Naber
On 2014-04-08 14:44, Daniel Naber wrote:

 I have now added a
 branch (readable-pos-tags) for this, simply because the changes are
 getting so complex. It's still incomplete and buggy.

As you may have noticed, I did some work in this branch. You can see it 
at
https://github.com/languagetool-org/languagetool/tree/readable-pos-tags

Although it basically works for English and German, the changes have not 
been merged back to the master branch as I'm not happy with them. 
Writing a class that turns the internal POS tags (like NN) into 
structured POS tags (like pos=noun, number=singular) isn't very 
complicated, but still quite some work and it's obviously 
language-specific. These classes should be developed by people who 
actually speak the language. I'm not sure if that would actually happen 
so we might have several languages that only support the old POS tags 
for years and I'd like to avoid that.

Then there's the general problem that we cannot move all old POS tags to 
the new ones. It's not possible to do automatically, and it's also not 
desirable, as sometimes the old POS tags are much more compact. So we'd 
have two ways to do the same thing, basically forever.

So for now, I will keep the code in that branch and not merge it...

Regards
  Daniel


--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing software faster
#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-05-07 Thread Daniel Naber
On 2014-05-07 19:07, Marcin Miłkowski wrote:

 unification. I still don't get why German doesn't use it for
 disambiguation, for example.

Maybe because nobody has seen an urgent need for that yet. I don't work 
that much on the German rules, but I'm generally okay with the way they 
work.

Regards
  Daniel


--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
#149; 3 signs your SCM is hindering your productivity
#149; Requirements for releasing software faster
#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


XSD and namespaces (was: readable POS tags)

2014-04-10 Thread Marcin Miłkowski
W dniu 2014-04-09 18:41, Marcin Miłkowski pisze:
 W dniu 2014-04-09 16:31, Daniel Naber pisze:
 On 2014-04-09 09:44, Marcin Miłkowski wrote:

 So I'm not sure what is the problem. Basically, the number of wd
 elements have to correspond to the number of tokens inside the marker
 element. And that's it.

 It's working now in the sense that the tests work and one English
 example rule (CONFUSION_OF_OUR_OUT) use the new tags.

 Now how exactly do we map the Penn Tagset to the new structure? Here's
 the current mapping with a lot of TODOs, let me know if you have an idea
 how to get it right:

 https://github.com/languagetool-org/languagetool/blob/readable-pos-tags/languagetool-language-modules/en/src/main/java/org/languagetool/tagging/en/EnglishTagger.java

 Also, have you used namespaces on attributes in XSD? If so, could you
 provide a small example on how to write the XSD so that token
 en:tense=simple_past/ can be validated?

 No, but Dave Pawson is XML expert. I guess he will be happy to help.

And here's a tutorial:

http://www.liquid-technologies.com/Tutorials/XmlSchemas/XsdTutorial_04.aspx

Regards,
Marcin

 Regards,
 Marcin

 --
 Put Bad Developers to Shame
 Dominate Development with Jenkins Continuous Integration
 Continuously Automate Build, Test  Deployment
 Start a new project now. Try Jenkins in the cloud.
 http://p.sf.net/sfu/13600_Cloudbees
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test  Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-04-08 Thread Marcin Miłkowski
W dniu 2014-04-07 23:01, Daniel Naber pisze:
 On 2014-03-25 09:35, Daniel Naber wrote:

 I've written an overview of how we could use readable POS tags in LT:

 http://wiki.languagetool.org/readable-part-of-speech-tags
 I'm writing a prototypical implementation on this for English now. But
 there's one point where I'm stuck. Maybe I'm missing something obvious.
 Everything is fine for grammar.xml: we have new tags but keep the old
 ones, both work. But what about disambiguation.xml? What does it now
 mean to have something like this:

 disambig postag=VBD/

 Internally, we now have information like this: postag=VBD, pos=verb,
 tense=past (etc.). But the disambiguation only works on the old tag? I
 guess I will need to resolve VBD here so the action works on both the
 old and the new representation?

I think the only thing needed is to parse the tags again, if they are 
different.

 What if there's an action like this?
 Will I need to expand the 'JJ.?' against all known tags and then apply
 the change to the resolved (new) representation?

 disambig action=filter postag=JJ.?/disambig
Again, parse the tags again, discard the old info.

Regards,
Marcin


 Regards
Daniel


 --
 Put Bad Developers to Shame
 Dominate Development with Jenkins Continuous Integration
 Continuously Automate Build, Test  Deployment
 Start a new project now. Try Jenkins in the cloud.
 http://p.sf.net/sfu/13600_Cloudbees
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test  Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-04-08 Thread Daniel Naber
On 2014-04-08 08:43, Marcin Miłkowski wrote:

 Internally, we now have information like this: postag=VBD, pos=verb,
 tense=past (etc.). But the disambiguation only works on the old tag? I
 guess I will need to resolve VBD here so the action works on both the
 old and the new representation?
 
 I think the only thing needed is to parse the tags again, if they are
 different.

Although I'm not sure if I understood what you meant, I have now added a 
branch (readable-pos-tags) for this, simply because the changes are 
getting so complex. It's still incomplete and buggy.

Here's the basic idea of my changes in that branch: class TokenPoS is 
the new structured representation of POS tags. EnglishTagger returns one 
or more TokenPoS for a given traditional POS tag (like NNS). More than 
one will be returned in cases that are ambiguous in the new 
representation, e.g. walk/VBP can be person=1|2 number=singular and 
person=1|2|3 person=plural. Each AnalyzedToken has one TokenPoS.

Currently the problem is this (when running the tests):
Caused by: org.xml.sax.SAXException: English rule error. The number of 
interpretations specified with wd: 5 must be equal to the number of 
matched tokens (1)
  Line: 1525, column: 12.

I roughly understand what the problem is but not yet the solution... any 
help is welcome, also any hints that what I'm doing in that branch might 
be wrong.

Regards
  Daniel


--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test  Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-04-07 Thread Daniel Naber
On 2014-03-25 09:35, Daniel Naber wrote:

 I've written an overview of how we could use readable POS tags in LT:
 
 http://wiki.languagetool.org/readable-part-of-speech-tags

I'm writing a prototypical implementation on this for English now. But 
there's one point where I'm stuck. Maybe I'm missing something obvious. 
Everything is fine for grammar.xml: we have new tags but keep the old 
ones, both work. But what about disambiguation.xml? What does it now 
mean to have something like this:

disambig postag=VBD/

Internally, we now have information like this: postag=VBD, pos=verb, 
tense=past (etc.). But the disambiguation only works on the old tag? I 
guess I will need to resolve VBD here so the action works on both the 
old and the new representation? What if there's an action like this? 
Will I need to expand the 'JJ.?' against all known tags and then apply 
the change to the resolved (new) representation?

disambig action=filter postag=JJ.?/disambig

Regards
  Daniel


--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test  Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Daniel Naber
On 2014-03-25 21:59, Dominique Pellé wrote:

 power compared to using regexp.  Power users know regexp
 well as they are used in many programs so they don't have to
 learn something new. Power users also like the conciseness
 of regexp.

As you said, the old way of matching will still be there for 
compatibility reasons. I see your point about power users. The thing is, 
how does one become a power user? As it is now, someone might want to 
contribute rules without having technical knowledge. They would need to 
learn XML, regular expressions and the LT matching logic all at once. 
This is quite a barrier and I guess its one of the reasons we don't get 
contributions for the several languages that are not maintained. That's 
why I think readable tags and the online editor are important - if 
somebody doesn't contribute at all, they won't ever become a power user.

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Dave Pawson
On 26 March 2014 10:49, Daniel Naber daniel.na...@languagetool.org wrote:
 On 2014-03-25 21:59, Dominique Pellé wrote:

 power compared to using regexp.  Power users know regexp
 well as they are used in many programs so they don't have to
 learn something new. Power users also like the conciseness
 of regexp.

 As you said, the old way of matching will still be there for
 compatibility reasons. I see your point about power users. The thing is,
 how does one become a power user? As it is now, someone might want to
 contribute rules without having technical knowledge. They would need to
 learn XML, regular expressions and the LT matching logic all at once.
 This is quite a barrier and I guess its one of the reasons we don't get
 contributions for the several languages that are not maintained. That's
 why I think readable tags and the online editor are important - if
 somebody doesn't contribute at all, they won't ever become a power user.

Very good logic!
  How to encourage new users without insulting them?
  Newbies start here,
  Power users use this?
Totally wrong.
   How about Plain/normal rule generation and  'advanced' rule generation?
I agree both are needed.
regards




-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Daniel Naber
On 2014-03-25 14:24, Marcin Miłkowski wrote:

 So instead of just adding the POS tag we get from Morfologik to our
 AnalyzedToken object as a string, we interpret it and store something
 like pos = preposition, case = accusative. Is it that what you mean?
 
 Exactly.

Any ideas on how the VBP tag (in English) might fit into this approach, 
i.e. not 3rd person singular? Will we need to introduce a tag like 
pos = Not3rdPsSgVerb? That doesn't seem elegant but keeps it short.

Internally, it could be expanded to mean:
[{pos=verb, person=1|2, number=singular, tense=preset},
  {pos=verb, person=1|2|3, number=plural, tense=preset}]

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


RE: readable POS tags

2014-03-26 Thread Mike Unwalla
I agree that backward compatibility is important. Without backward
compatibility, the proposed change means that the content of disambiguation
files and grammar files must be changed. That is a huge task.

Even if you develop a utility that lets people convert files to the new
format, there remains a problem of the conversion of non-standard postags.
For example, 'Adding only POS tags or tokens' shows how to add a
non-standard postag 'UP'
(http://wiki.languagetool.org/developing-a-disambiguator#toc8'). What will
be the effect of the proposed change on non-standard postags?

Regards,

Mike Unwalla
Contact: www.techscribe.co.uk/techw/contact.htm 

-Original Message-
From: Dominique Pellé [mailto:dominique.pe...@gmail.com] 
snip

In any case, we need to preserve rule backward compatibility.
I cannot imagine having to change manually all rules in
all languages, at least not manually. I would be a lot of
error prone work.
snip


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Marcin Miłkowski
W dniu 2014-03-26 13:51, Daniel Naber pisze:
 On 2014-03-25 14:24, Marcin Miłkowski wrote:

 So instead of just adding the POS tag we get from Morfologik to our
 AnalyzedToken object as a string, we interpret it and store something
 like pos = preposition, case = accusative. Is it that what you mean?

 Exactly.

 Any ideas on how the VBP tag (in English) might fit into this approach,
 i.e. not 3rd person singular? Will we need to introduce a tag like
 pos = Not3rdPsSgVerb? That doesn't seem elegant but keeps it short.

No, that would be horrible, as this is not an improvement. The problem 
is not that tags are cryptic and short; it is that they do not make 
features easily available separately.

My use case for readable pos tags is also speed and simplicity for 
unification (rules that use agreement between words). It is simply 
faster to specify features by citing appropriate attributes that can be 
processed once instead of running a regexp every time the sentence is 
processed in a unification rule. For Catalan, Polish, and French this 
will be a huge time improvement.

Now, for this to work the attributes should be specified just like they 
are in Corpus Query Language (CQL).


 Internally, it could be expanded to mean:
 [{pos=verb, person=1|2, number=singular, tense=preset},
{pos=verb, person=1|2|3, number=plural, tense=preset}]


So for the word tagged as VBP we could have

token pos = verb person=1|2 number=sg en:tense=present/

or

token pos = verb person=3 number=pl en:tense=present/

Both would match a word with VBP. (Note that the disambiguator could 
even remove one of the interpretations to make it clear that this is a 
plural use of the token!)

Above, I used a mixture of attributes without namespaces (these would be 
universal for all languages) and ones with namespaces, like tense, which 
is not present in all languages. We can look at proposed Universal 
Tagset to find universal categories:

https://code.google.com/p/universal-pos-tags/

Note also that one could write:

token pos=verb/

And this would be equivalent to:

token postag=VB.* postag_regexp=yes/

But possibly a lot faster. The new syntax comes out also much easier to 
read, and would be equivalent to CQL query:

[pos=verb]

Similarly for words in comparative degree, where you have to use now 
(for English):

token postag=..R postag_regexp=yes/

You could simply say:

token degree=com/

Basically, by making attributes separate we could have a much easier way 
to write complex rules without problems as to how specify POS tags. I 
consider myself to be a power user but with a complex Polish tagset it 
is sometimes really difficult to specify the features I want using 
regexes: the tagset itself creates pretty complex and lengthy strings 
and a lot of time is needed to make sure that the regex matches.

Regards,
Marcin


 Regards
Daniel


 --
 Learn Graph Databases - Download FREE O'Reilly Book
 Graph Databases is the definitive new guide to graph databases and their
 applications. Written by three acclaimed leaders in the field,
 this first edition is now available. Download your free book today!
 http://p.sf.net/sfu/13534_NeoTech
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Marcin Miłkowski
W dniu 2014-03-26 15:20, Mike Unwalla pisze:
 I agree that backward compatibility is important. Without backward
 compatibility, the proposed change means that the content of disambiguation
 files and grammar files must be changed. That is a huge task.

 Even if you develop a utility that lets people convert files to the new
 format, there remains a problem of the conversion of non-standard postags.
 For example, 'Adding only POS tags or tokens' shows how to add a
 non-standard postag 'UP'
 (http://wiki.languagetool.org/developing-a-disambiguator#toc8'). What will
 be the effect of the proposed change on non-standard postags?

I am afraid that parsing these new tags would be difficult so we would 
need to define new attributes and values in the disambiguator; but it 
would be difficult to specify those in the rules as rules are written in 
static XML so they cannot use a construct that is defined outside the 
schema, and we'd need to modify the schema. But as we would retain the 
old postag interface (I think this is a must!) this would not be a problem.

Regards,
Marcin


 Regards,

 Mike Unwalla
 Contact: www.techscribe.co.uk/techw/contact.htm

 -Original Message-
 From: Dominique Pellé [mailto:dominique.pe...@gmail.com]
 snip

 In any case, we need to preserve rule backward compatibility.
 I cannot imagine having to change manually all rules in
 all languages, at least not manually. I would be a lot of
 error prone work.
 snip


 --
 Learn Graph Databases - Download FREE O'Reilly Book
 Graph Databases is the definitive new guide to graph databases and their
 applications. Written by three acclaimed leaders in the field,
 this first edition is now available. Download your free book today!
 http://p.sf.net/sfu/13534_NeoTech
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel




--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Daniel Naber
On 2014-03-26 17:49, Marcin Miłkowski wrote:

 No, that would be horrible, as this is not an improvement. The problem
 is not that tags are cryptic and short;

That's also a problem, but not so much for power users and for everybody 
else we will be able to solve that in the user interface (i.e. rule 
editor).

 So for the word tagged as VBP we could have
 
 token pos = verb person=1|2 number=sg en:tense=present/
 
 or
 
 token pos = verb person=3 number=pl en:tense=present/

That makes sense.

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


readable POS tags

2014-03-25 Thread Daniel Naber
Hi,

I've written an overview of how we could use readable POS tags in LT:

http://wiki.languagetool.org/readable-part-of-speech-tags

The core part however - how do these new POS tags actually look like - 
is still missing. Any input on the overview and ideas about that core 
part is welcome.

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-25 Thread Dave Pawson
On 25 March 2014 08:35, Daniel Naber daniel.na...@languagetool.org wrote:
 Hi,

 I've written an overview of how we could use readable POS tags in LT:

 http://wiki.languagetool.org/readable-part-of-speech-tags

 The core part however - how do these new POS tags actually look like -
 is still missing. Any input on the overview and ideas about that core
 part is welcome.

A glossary, but only if this document is meant for non-grammarians?
E.g. instead of DET, use: determiner  term not understood.


Options for tags
http://en.wikipedia.org/wiki/Part-of-speech_tagging
or
http://www.cis.upenn.edu/~treebank/ is said to be more complete than most.

HTH

-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-25 Thread Daniel Naber
On 2014-03-25 11:07, Marcin Miłkowski wrote:

 For all I can see, no HashMaps are required at all, just a consistent
 way of understanding the values in class members.

So instead of just adding the POS tag we get from Morfologik to our 
AnalyzedToken object as a string, we interpret it and store something 
like pos = preposition, case = accusative. Is it that what you mean? 
Then indeed 'mapping' isn't a good term for that. We could make all keys 
and values enums, so we have type safety and don't have to deal with 
strings. Do we need type safety so that you cannot use a Polish-only 
value for German even in Java code? I think that may not be needed.

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-25 Thread Marcin Miłkowski
W dniu 2014-03-25 13:29, Daniel Naber pisze:
 On 2014-03-25 11:07, Marcin Miłkowski wrote:

 For all I can see, no HashMaps are required at all, just a consistent
 way of understanding the values in class members.
 So instead of just adding the POS tag we get from Morfologik to our
 AnalyzedToken object as a string, we interpret it and store something
 like pos = preposition, case = accusative. Is it that what you mean?

Exactly. We pay the computational price just once, during parsing the 
tag, and given that most tags are pretty nicely structured (except 
Penn!), we could parse them very quickly.

Yet I would still store the string as well: we use it for the 
synthesizer and changing the synthesizer is not as trivial as parsing 
tags (we might, of course, recreate the tag from the keys and values but 
this is just additional computational overhead). Also, some rules may be 
hard to express without regexes on POS tags (we will still need to regex 
in case of a disjunction of several different POS tags, I'm afraid, but 
these regexes will be quite rare).

 Then indeed 'mapping' isn't a good term for that. We could make all keys
 and values enums, so we have type safety and don't have to deal with
 strings. Do we need type safety so that you cannot use a Polish-only
 value for German even in Java code? I think that may not be needed.

As long as it's enforced in XML namespaces (which should be doable), we 
should be fine. Developers should know what they're doing. LOL ;)

Regards,
  Marcin


 Regards
Daniel


 --
 Learn Graph Databases - Download FREE O'Reilly Book
 Graph Databases is the definitive new guide to graph databases and their
 applications. Written by three acclaimed leaders in the field,
 this first edition is now available. Download your free book today!
 http://p.sf.net/sfu/13534_NeoTech
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-25 Thread Dominique Pellé
Daniel Naber wrote:

 Hi,

 I've written an overview of how we could use readable POS tags in LT:

 http://wiki.languagetool.org/readable-part-of-speech-tags

 The core part however - how do these new POS tags actually look like -
 is still missing. Any input on the overview and ideas about that core
 part is welcome.

 Regards
   Daniel


I'm not convinced it will simplify things. In fact I suspect it would
make rules more obscure, more verbose and possibly lose
power compared to using regexp.  Power users know regexp
well as they are used in many programs so they don't have to
learn something new. Power users also like the conciseness
of regexp.

Anybody who wants to seriously contributes to LT needs to
understand regexp anyway (not just for POS tags). And
knowing regexp is useful for plenty of other programs.

A good documentation of POS tags for each language should
suffice in my opinion.

POS tags are also very different in each language and for
good reasons as grammar of the languages can be quite different.

In any case, we need to preserve rule backward compatibility.
I cannot imagine having to change manually all rules in
all languages, at least not manually. I would be a lot of
error prone work.

On the other hand, rule matching may be faster without regexp.

Regards
Dominique

--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel