Re: readable POS tags

2014-05-07 Thread Marcin Miłkowski
W dniu 2014-05-07 19:56, Daniel Naber pisze:
> On 2014-05-07 19:07, Marcin Miłkowski wrote:
>
>> unification. I still don't get why German doesn't use it for
>> disambiguation, for example.
> Maybe because nobody has seen an urgent need for that yet. I don't work
> that much on the German rules, but I'm generally okay with the way they
> work.
Well, after I implement chunking via disambiguation, unification could 
easily mark up nominal phrases composed of the article, adjectives, and 
the noun. It's also useful to discard ambiguous readings before 
synthesizing a sensible suggestion.

Regards,
Marcin

--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-05-07 Thread Daniel Naber
On 2014-05-07 19:07, Marcin Miłkowski wrote:

> unification. I still don't get why German doesn't use it for
> disambiguation, for example.

Maybe because nobody has seen an urgent need for that yet. I don't work 
that much on the German rules, but I'm generally okay with the way they 
work.

Regards
  Daniel


--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-05-07 Thread Marcin Miłkowski
W dniu 2014-05-07 12:10, Daniel Naber pisze:
> On 2014-04-08 14:44, Daniel Naber wrote:
>
>> I have now added a
>> branch ("readable-pos-tags") for this, simply because the changes are
>> getting so complex. It's still incomplete and buggy.
>
> As you may have noticed, I did some work in this branch. You can see it
> at
> https://github.com/languagetool-org/languagetool/tree/readable-pos-tags
>
> Although it basically works for English and German, the changes have not
> been merged back to the master branch as I'm not happy with them.
> Writing a class that turns the internal POS tags (like "NN") into
> structured POS tags (like "pos=noun, number=singular") isn't very
> complicated, but still quite some work and it's obviously
> language-specific. These classes should be developed by people who
> actually speak the language. I'm not sure if that would actually happen
> so we might have several languages that only support the old POS tags
> for years and I'd like to avoid that.

We might, but that's the general principle for other features such as 
unification. I still don't get why German doesn't use it for 
disambiguation, for example. I could write up some simple rules to leave 
only token readings that agree with each other.

>
> Then there's the general problem that we cannot move all old POS tags to
> the new ones. It's not possible to do automatically, and it's also not
> desirable, as sometimes the old POS tags are much more compact. So we'd
> have two ways to do the same thing, basically forever.

I don't see it as particularly wrong. For languages that use the 
Unifier, we have to run regexes multiple times on the same token, and 
that slows processing down. With attributes, we could make it much 
faster. So this dual route would actually speed up Catalan, French, and 
Polish (and maybe other languages as well).

Regards,
Marcin


>
> So for now, I will keep the code in that branch and not merge it...
>
> Regards
>Daniel
>
>
> --
> Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
> • 3 signs your SCM is hindering your productivity
> • Requirements for releasing software faster
> • Expert tips and advice for migrating your SCM now
> http://p.sf.net/sfu/perforce
> ___
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>


--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-05-07 Thread Daniel Naber
On 2014-04-08 14:44, Daniel Naber wrote:

> I have now added a
> branch ("readable-pos-tags") for this, simply because the changes are
> getting so complex. It's still incomplete and buggy.

As you may have noticed, I did some work in this branch. You can see it 
at
https://github.com/languagetool-org/languagetool/tree/readable-pos-tags

Although it basically works for English and German, the changes have not 
been merged back to the master branch as I'm not happy with them. 
Writing a class that turns the internal POS tags (like "NN") into 
structured POS tags (like "pos=noun, number=singular") isn't very 
complicated, but still quite some work and it's obviously 
language-specific. These classes should be developed by people who 
actually speak the language. I'm not sure if that would actually happen 
so we might have several languages that only support the old POS tags 
for years and I'd like to avoid that.

Then there's the general problem that we cannot move all old POS tags to 
the new ones. It's not possible to do automatically, and it's also not 
desirable, as sometimes the old POS tags are much more compact. So we'd 
have two ways to do the same thing, basically forever.

So for now, I will keep the code in that branch and not merge it...

Regards
  Daniel


--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


XSD and namespaces (was: readable POS tags)

2014-04-10 Thread Marcin Miłkowski
W dniu 2014-04-09 18:41, Marcin Miłkowski pisze:
> W dniu 2014-04-09 16:31, Daniel Naber pisze:
>> On 2014-04-09 09:44, Marcin Miłkowski wrote:
>>
>>> So I'm not sure what is the problem. Basically, the number of 
>>> elements have to correspond to the number of tokens inside the 
>>> element. And that's it.
>>
>> It's working now in the sense that the tests work and one English
>> example rule (CONFUSION_OF_OUR_OUT) use the new tags.
>>
>> Now how exactly do we map the Penn Tagset to the new structure? Here's
>> the current mapping with a lot of TODOs, let me know if you have an idea
>> how to get it right:
>>
>> https://github.com/languagetool-org/languagetool/blob/readable-pos-tags/languagetool-language-modules/en/src/main/java/org/languagetool/tagging/en/EnglishTagger.java
>>
>> Also, have you used namespaces on attributes in XSD? If so, could you
>> provide a small example on how to write the XSD so that > en:tense="simple_past"/> can be validated?
>
> No, but Dave Pawson is XML expert. I guess he will be happy to help.

And here's a tutorial:

http://www.liquid-technologies.com/Tutorials/XmlSchemas/XsdTutorial_04.aspx

Regards,
Marcin
>
> Regards,
> Marcin
>
> --
> Put Bad Developers to Shame
> Dominate Development with Jenkins Continuous Integration
> Continuously Automate Build, Test & Deployment
> Start a new project now. Try Jenkins in the cloud.
> http://p.sf.net/sfu/13600_Cloudbees
> ___
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-04-09 Thread Marcin Miłkowski
W dniu 2014-04-09 16:31, Daniel Naber pisze:
> On 2014-04-09 09:44, Marcin Miłkowski wrote:
>
>> So I'm not sure what is the problem. Basically, the number of 
>> elements have to correspond to the number of tokens inside the 
>> element. And that's it.
>
> It's working now in the sense that the tests work and one English
> example rule (CONFUSION_OF_OUR_OUT) use the new tags.
>
> Now how exactly do we map the Penn Tagset to the new structure? Here's
> the current mapping with a lot of TODOs, let me know if you have an idea
> how to get it right:
>
> https://github.com/languagetool-org/languagetool/blob/readable-pos-tags/languagetool-language-modules/en/src/main/java/org/languagetool/tagging/en/EnglishTagger.java
>
> Also, have you used namespaces on attributes in XSD? If so, could you
> provide a small example on how to write the XSD so that  en:tense="simple_past"/> can be validated?

No, but Dave Pawson is XML expert. I guess he will be happy to help.

Regards,
Marcin

--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-04-09 Thread Daniel Naber
On 2014-04-09 09:44, Marcin Miłkowski wrote:

> So I'm not sure what is the problem. Basically, the number of 
> elements have to correspond to the number of tokens inside the 
> element. And that's it.

It's working now in the sense that the tests work and one English 
example rule (CONFUSION_OF_OUR_OUT) use the new tags.

Now how exactly do we map the Penn Tagset to the new structure? Here's 
the current mapping with a lot of TODOs, let me know if you have an idea 
how to get it right:

https://github.com/languagetool-org/languagetool/blob/readable-pos-tags/languagetool-language-modules/en/src/main/java/org/languagetool/tagging/en/EnglishTagger.java

Also, have you used namespaces on attributes in XSD? If so, could you 
provide a small example on how to write the XSD so that  can be validated?

Regards
  Daniel


--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-04-09 Thread Marcin Miłkowski
W dniu 2014-04-08 14:44, Daniel Naber pisze:
> On 2014-04-08 08:43, Marcin Miłkowski wrote:
>
>>> Internally, we now have information like this: postag=VBD, pos=verb,
>>> tense=past (etc.). But the disambiguation only works on the old tag? I
>>> guess I will need to resolve VBD here so the action works on both the
>>> old and the new representation?
>>
>> I think the only thing needed is to parse the tags again, if they are
>> different.
>
> Although I'm not sure if I understood what you meant, I have now added a
> branch ("readable-pos-tags") for this, simply because the changes are
> getting so complex. It's still incomplete and buggy.
>
> Here's the basic idea of my changes in that branch: class TokenPoS is
> the new structured representation of POS tags. EnglishTagger returns one
> or more TokenPoS for a given traditional POS tag (like NNS). More than
> one will be returned in cases that are ambiguous in the new
> representation, e.g. "walk/VBP" can be person=1|2 number=singular and
> person=1|2|3 person=plural. Each AnalyzedToken has one TokenPoS.
>
> Currently the problem is this (when running the tests):
> Caused by: org.xml.sax.SAXException: English rule error. The number of
> interpretations specified with wd: 5 must be equal to the number of
> matched tokens (1)
>Line: 1525, column: 12.
>
> I roughly understand what the problem is but not yet the solution... any
> help is welcome, also any hints that what I'm doing in that branch might
> be wrong.

Hm, the line 1525 is:

 

So I'm not sure what is the problem. Basically, the number of  
elements have to correspond to the number of tokens inside the  
element. And that's it.

I thought that TokenPoS should be initialized after tagging and then any 
time the disambiguator makes the change to the AnalyzedToken.posTag. So 
whenever the disambiguator changes the values of the token, you need to 
make sure that the TokenPoS is up to date by re-running the POS tag 
parser. What's the problem with this approach? I assume that we are not 
talking about disambiguation rules (yet) that change TokenPoS only. This 
is a bit more tricky, as some tags might be more ambiguous than TokenPoS 
values, so we'd have to leave those ambiguous tags and prune or change 
only TokenPoS values. Luckily, I think only the Penn tagset is so 
ambiguous. Structured tagsets (such as German Morphy or Polish one) 
should be much easier to interpret.

Regards,
Marcin

--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-04-08 Thread Daniel Naber
On 2014-04-08 08:43, Marcin Miłkowski wrote:

>> Internally, we now have information like this: postag=VBD, pos=verb,
>> tense=past (etc.). But the disambiguation only works on the old tag? I
>> guess I will need to resolve VBD here so the action works on both the
>> old and the new representation?
> 
> I think the only thing needed is to parse the tags again, if they are
> different.

Although I'm not sure if I understood what you meant, I have now added a 
branch ("readable-pos-tags") for this, simply because the changes are 
getting so complex. It's still incomplete and buggy.

Here's the basic idea of my changes in that branch: class TokenPoS is 
the new structured representation of POS tags. EnglishTagger returns one 
or more TokenPoS for a given traditional POS tag (like NNS). More than 
one will be returned in cases that are ambiguous in the new 
representation, e.g. "walk/VBP" can be person=1|2 number=singular and 
person=1|2|3 person=plural. Each AnalyzedToken has one TokenPoS.

Currently the problem is this (when running the tests):
Caused by: org.xml.sax.SAXException: English rule error. The number of 
interpretations specified with wd: 5 must be equal to the number of 
matched tokens (1)
  Line: 1525, column: 12.

I roughly understand what the problem is but not yet the solution... any 
help is welcome, also any hints that what I'm doing in that branch might 
be wrong.

Regards
  Daniel


--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-04-07 Thread Marcin Miłkowski
W dniu 2014-04-07 23:01, Daniel Naber pisze:
> On 2014-03-25 09:35, Daniel Naber wrote:
>
>> I've written an overview of how we could use readable POS tags in LT:
>>
>> http://wiki.languagetool.org/readable-part-of-speech-tags
> I'm writing a prototypical implementation on this for English now. But
> there's one point where I'm stuck. Maybe I'm missing something obvious.
> Everything is fine for grammar.xml: we have new tags but keep the old
> ones, both work. But what about disambiguation.xml? What does it now
> mean to have something like this:
>
> 
>
> Internally, we now have information like this: postag=VBD, pos=verb,
> tense=past (etc.). But the disambiguation only works on the old tag? I
> guess I will need to resolve VBD here so the action works on both the
> old and the new representation?

I think the only thing needed is to parse the tags again, if they are 
different.

> What if there's an action like this?
> Will I need to expand the 'JJ.?' against all known tags and then apply
> the change to the resolved (new) representation?
>
> 
Again, parse the tags again, discard the old info.

Regards,
Marcin

>
> Regards
>Daniel
>
>
> --
> Put Bad Developers to Shame
> Dominate Development with Jenkins Continuous Integration
> Continuously Automate Build, Test & Deployment
> Start a new project now. Try Jenkins in the cloud.
> http://p.sf.net/sfu/13600_Cloudbees
> ___
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>


--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-04-07 Thread Daniel Naber
On 2014-03-25 09:35, Daniel Naber wrote:

> I've written an overview of how we could use readable POS tags in LT:
> 
> http://wiki.languagetool.org/readable-part-of-speech-tags

I'm writing a prototypical implementation on this for English now. But 
there's one point where I'm stuck. Maybe I'm missing something obvious. 
Everything is fine for grammar.xml: we have new tags but keep the old 
ones, both work. But what about disambiguation.xml? What does it now 
mean to have something like this:



Internally, we now have information like this: postag=VBD, pos=verb, 
tense=past (etc.). But the disambiguation only works on the old tag? I 
guess I will need to resolve VBD here so the action works on both the 
old and the new representation? What if there's an action like this? 
Will I need to expand the 'JJ.?' against all known tags and then apply 
the change to the resolved (new) representation?



Regards
  Daniel


--
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Daniel Naber
On 2014-03-26 17:49, Marcin Miłkowski wrote:

> No, that would be horrible, as this is not an improvement. The problem
> is not that tags are cryptic and short;

That's also a problem, but not so much for power users and for everybody 
else we will be able to solve that in the user interface (i.e. rule 
editor).

> So for the word tagged as VBP we could have
> 
> 
> 
> or
> 
> 

That makes sense.

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Marcin Miłkowski
W dniu 2014-03-26 17:06, Dave Pawson pisze:
> On 26 March 2014 12:51, Daniel Naber  wrote:
>
>> Any ideas on how the VBP tag (in English) might fit into this approach,
>> i.e. "not 3rd person singular"? Will we need to introduce a tag like
>> "pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.
>>
>> Internally, it could be expanded to mean:
>> [{pos=verb, person=1|2, number=singular, tense=preset},
>>{pos=verb, person=1|2|3, number=plural, tense=preset}]
>
>
> I have a natural dislike for negatives, so person=1|2 IMHO is the better 
> option?
> pos="1or2singular" perhaps?
>
> for {pos=verb, person=1|2|3, number=plural, tense=preset}] is there 
> redundancy?
> If 1,2 or 3 are included it is not necessary?
> pos="plural,present" (is preset a typo?)
> {pos=verb,  number=plural, tense=preset}]
>
> Is there no word for singular OR plural?

Fortunately, no :)

Anyway, see my reply to Daniel. I don't think "Not3rdblablah" is useful 
at all.

Regards,
  Marcin

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Marcin Miłkowski
W dniu 2014-03-26 15:20, Mike Unwalla pisze:
> I agree that backward compatibility is important. Without backward
> compatibility, the proposed change means that the content of disambiguation
> files and grammar files must be changed. That is a huge task.
>
> Even if you develop a utility that lets people convert files to the new
> format, there remains a problem of the conversion of non-standard postags.
> For example, 'Adding only POS tags or tokens' shows how to add a
> non-standard postag 'UP'
> (http://wiki.languagetool.org/developing-a-disambiguator#toc8'). What will
> be the effect of the proposed change on non-standard postags?

I am afraid that parsing these new tags would be difficult so we would 
need to define new attributes and values in the disambiguator; but it 
would be difficult to specify those in the rules as rules are written in 
static XML so they cannot use a construct that is defined outside the 
schema, and we'd need to modify the schema. But as we would retain the 
old postag interface (I think this is a must!) this would not be a problem.

Regards,
Marcin

>
> Regards,
>
> Mike Unwalla
> Contact: www.techscribe.co.uk/techw/contact.htm
>
> -Original Message-
> From: Dominique Pellé [mailto:[email protected]]
> 
>
> In any case, we need to preserve rule backward compatibility.
> I cannot imagine having to change manually all rules in
> all languages, at least not manually. I would be a lot of
> error prone work.
> 
>
>
> --
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> ___
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Marcin Miłkowski
W dniu 2014-03-26 13:51, Daniel Naber pisze:
> On 2014-03-25 14:24, Marcin Miłkowski wrote:
>
>>> So instead of just adding the POS tag we get from Morfologik to our
>>> AnalyzedToken object as a string, we interpret it and store something
>>> like pos = preposition, case = accusative. Is it that what you mean?
>>
>> Exactly.
>
> Any ideas on how the VBP tag (in English) might fit into this approach,
> i.e. "not 3rd person singular"? Will we need to introduce a tag like
> "pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.

No, that would be horrible, as this is not an improvement. The problem 
is not that tags are cryptic and short; it is that they do not make 
features easily available separately.

My use case for readable pos tags is also speed and simplicity for 
unification (rules that use agreement between words). It is simply 
faster to specify features by citing appropriate attributes that can be 
processed once instead of running a regexp every time the sentence is 
processed in a unification rule. For Catalan, Polish, and French this 
will be a huge time improvement.

Now, for this to work the attributes should be specified just like they 
are in Corpus Query Language (CQL).

>
> Internally, it could be expanded to mean:
> [{pos=verb, person=1|2, number=singular, tense=preset},
>{pos=verb, person=1|2|3, number=plural, tense=preset}]


So for the word tagged as VBP we could have



or



Both would match a word with VBP. (Note that the disambiguator could 
even remove one of the interpretations to make it clear that this is a 
plural use of the token!)

Above, I used a mixture of attributes without namespaces (these would be 
universal for all languages) and ones with namespaces, like tense, which 
is not present in all languages. We can look at proposed Universal 
Tagset to find universal categories:

https://code.google.com/p/universal-pos-tags/

Note also that one could write:



And this would be equivalent to:



But possibly a lot faster. The new syntax comes out also much easier to 
read, and would be equivalent to CQL query:

[pos="verb"]

Similarly for words in comparative degree, where you have to use now 
(for English):



You could simply say:



Basically, by making attributes separate we could have a much easier way 
to write complex rules without problems as to how specify POS tags. I 
consider myself to be a power user but with a complex Polish tagset it 
is sometimes really difficult to specify the features I want using 
regexes: the tagset itself creates pretty complex and lengthy strings 
and a lot of time is needed to make sure that the regex matches.

Regards,
Marcin

>
> Regards
>Daniel
>
>
> --
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> ___
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Dave Pawson
On 26 March 2014 12:51, Daniel Naber  wrote:

> Any ideas on how the VBP tag (in English) might fit into this approach,
> i.e. "not 3rd person singular"? Will we need to introduce a tag like
> "pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.
>
> Internally, it could be expanded to mean:
> [{pos=verb, person=1|2, number=singular, tense=preset},
>   {pos=verb, person=1|2|3, number=plural, tense=preset}]


I have a natural dislike for negatives, so person=1|2 IMHO is the better option?
pos="1or2singular" perhaps?

for {pos=verb, person=1|2|3, number=plural, tense=preset}] is there redundancy?
If 1,2 or 3 are included it is not necessary?
pos="plural,present" (is preset a typo?)
{pos=verb,  number=plural, tense=preset}]

Is there no word for singular OR plural?

regards

-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


RE: readable POS tags

2014-03-26 Thread Mike Unwalla
I agree that backward compatibility is important. Without backward
compatibility, the proposed change means that the content of disambiguation
files and grammar files must be changed. That is a huge task.

Even if you develop a utility that lets people convert files to the new
format, there remains a problem of the conversion of non-standard postags.
For example, 'Adding only POS tags or tokens' shows how to add a
non-standard postag 'UP'
(http://wiki.languagetool.org/developing-a-disambiguator#toc8'). What will
be the effect of the proposed change on non-standard postags?

Regards,

Mike Unwalla
Contact: www.techscribe.co.uk/techw/contact.htm 

-Original Message-
From: Dominique Pellé [mailto:[email protected]] 


In any case, we need to preserve rule backward compatibility.
I cannot imagine having to change manually all rules in
all languages, at least not manually. I would be a lot of
error prone work.



--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Daniel Naber
On 2014-03-25 14:24, Marcin Miłkowski wrote:

>> So instead of just adding the POS tag we get from Morfologik to our
>> AnalyzedToken object as a string, we interpret it and store something
>> like pos = preposition, case = accusative. Is it that what you mean?
> 
> Exactly.

Any ideas on how the VBP tag (in English) might fit into this approach, 
i.e. "not 3rd person singular"? Will we need to introduce a tag like 
"pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.

Internally, it could be expanded to mean:
[{pos=verb, person=1|2, number=singular, tense=preset},
  {pos=verb, person=1|2|3, number=plural, tense=preset}]

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread R.J. Baars
About the difficulties in learning LT..

I started out long ago for LT as a IT pro, language hobbyist.

The help of Daniel and Marcin at that time made it possible to get started.
Starting with XML and plain words rules was not that difficult. Trial and
error, and some help from this list was enough.

Complexity rose when postags had to be made. That takes a lot of language
knowledge, but most complex is to get the software in place to generate
the lists.

In general, there is a lot of focus on programming the programming tool
level. (EDI, GIT etc.) Too complex for non-programmers. Making things
'wysiwyg' and webbased is a good direction.

(Example:
Translate the site content to any language from the site itself?
Edit postags and words for any language online, auto generation of the
dictionaries.
)

Maybe there should be a 'programming' focus, a 'language' focus, and a
'rule' focus area?

Just my 2 cents.

Ruud


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Dave Pawson
On 26 March 2014 10:49, Daniel Naber  wrote:
> On 2014-03-25 21:59, Dominique Pellé wrote:
>
>> power compared to using regexp.  Power users know regexp
>> well as they are used in many programs so they don't have to
>> learn something new. Power users also like the conciseness
>> of regexp.
>
> As you said, the old way of matching will still be there for
> compatibility reasons. I see your point about power users. The thing is,
> how does one become a power user? As it is now, someone might want to
> contribute rules without having technical knowledge. They would need to
> learn XML, regular expressions and the LT matching logic all at once.
> This is quite a barrier and I guess its one of the reasons we don't get
> contributions for the several languages that are not maintained. That's
> why I think readable tags and the online editor are important - if
> somebody doesn't contribute at all, they won't ever become a power user.

Very good logic!
  How to encourage new users without insulting them?
  Newbies start here,
  Power users use this?
Totally wrong.
   How about Plain/normal rule generation and  'advanced' rule generation?
I agree both are needed.
regards




-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-26 Thread Daniel Naber
On 2014-03-25 21:59, Dominique Pellé wrote:

> power compared to using regexp.  Power users know regexp
> well as they are used in many programs so they don't have to
> learn something new. Power users also like the conciseness
> of regexp.

As you said, the old way of matching will still be there for 
compatibility reasons. I see your point about power users. The thing is, 
how does one become a power user? As it is now, someone might want to 
contribute rules without having technical knowledge. They would need to 
learn XML, regular expressions and the LT matching logic all at once. 
This is quite a barrier and I guess its one of the reasons we don't get 
contributions for the several languages that are not maintained. That's 
why I think readable tags and the online editor are important - if 
somebody doesn't contribute at all, they won't ever become a power user.

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-25 Thread Dominique Pellé
Daniel Naber wrote:

> Hi,
>
> I've written an overview of how we could use readable POS tags in LT:
>
> http://wiki.languagetool.org/readable-part-of-speech-tags
>
> The core part however - how do these new POS tags actually look like -
> is still missing. Any input on the overview and ideas about that core
> part is welcome.
>
> Regards
>   Daniel


I'm not convinced it will simplify things. In fact I suspect it would
make rules more obscure, more verbose and possibly lose
power compared to using regexp.  Power users know regexp
well as they are used in many programs so they don't have to
learn something new. Power users also like the conciseness
of regexp.

Anybody who wants to seriously contributes to LT needs to
understand regexp anyway (not just for POS tags). And
knowing regexp is useful for plenty of other programs.

A good documentation of POS tags for each language should
suffice in my opinion.

POS tags are also very different in each language and for
good reasons as grammar of the languages can be quite different.

In any case, we need to preserve rule backward compatibility.
I cannot imagine having to change manually all rules in
all languages, at least not manually. I would be a lot of
error prone work.

On the other hand, rule matching may be faster without regexp.

Regards
Dominique

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-25 Thread Marcin Miłkowski
W dniu 2014-03-25 13:29, Daniel Naber pisze:
> On 2014-03-25 11:07, Marcin Miłkowski wrote:
>
>> For all I can see, no HashMaps are required at all, just a consistent
>> way of understanding the values in class members.
> So instead of just adding the POS tag we get from Morfologik to our
> AnalyzedToken object as a string, we interpret it and store something
> like pos = preposition, case = accusative. Is it that what you mean?

Exactly. We pay the computational price just once, during parsing the 
tag, and given that most tags are pretty nicely structured (except 
Penn!), we could parse them very quickly.

Yet I would still store the string as well: we use it for the 
synthesizer and changing the synthesizer is not as trivial as parsing 
tags (we might, of course, recreate the tag from the keys and values but 
this is just additional computational overhead). Also, some rules may be 
hard to express without regexes on POS tags (we will still need to regex 
in case of a disjunction of several different POS tags, I'm afraid, but 
these regexes will be quite rare).

> Then indeed 'mapping' isn't a good term for that. We could make all keys
> and values enums, so we have type safety and don't have to deal with
> strings. Do we need type safety so that you cannot use a Polish-only
> value for German even in Java code? I think that may not be needed.

As long as it's enforced in XML namespaces (which should be doable), we 
should be fine. Developers should know what they're doing. LOL ;)

Regards,
  Marcin

>
> Regards
>Daniel
>
>
> --
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> ___
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-25 Thread Daniel Naber
On 2014-03-25 11:07, Marcin Miłkowski wrote:

> For all I can see, no HashMaps are required at all, just a consistent
> way of understanding the values in class members.

So instead of just adding the POS tag we get from Morfologik to our 
AnalyzedToken object as a string, we interpret it and store something 
like pos = preposition, case = accusative. Is it that what you mean? 
Then indeed 'mapping' isn't a good term for that. We could make all keys 
and values enums, so we have type safety and don't have to deal with 
strings. Do we need type safety so that you cannot use a Polish-only 
value for German even in Java code? I think that may not be needed.

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-25 Thread Dave Pawson
On 25 March 2014 08:35, Daniel Naber  wrote:
> Hi,
>
> I've written an overview of how we could use readable POS tags in LT:
>
> http://wiki.languagetool.org/readable-part-of-speech-tags
>
> The core part however - how do these new POS tags actually look like -
> is still missing. Any input on the overview and ideas about that core
> part is welcome.

A glossary, but only if this document is meant for non-grammarians?
E.g. "instead of DET, use: determiner"  term not understood.


Options for tags
http://en.wikipedia.org/wiki/Part-of-speech_tagging
or
http://www.cis.upenn.edu/~treebank/ is said to be more complete than most.

HTH

-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: readable POS tags

2014-03-25 Thread Marcin Miłkowski
W dniu 2014-03-25 09:35, Daniel Naber pisze:
> Hi,
>
> I've written an overview of how we could use readable POS tags in LT:
>
> http://wiki.languagetool.org/readable-part-of-speech-tags
>
> The core part however - how do these new POS tags actually look like -
> is still missing. Any input on the overview and ideas about that core
> part is welcome.

I think that we should not change the existing POS tags, as there are 
features in some languages that are not found in others. For example, 
Polish verbs have perfective or imperfective aspect, and there are also 
reflexive verbs and partially nonreflexive verbs, special agglutinates 
etc. I don't think it will be easy to find a superset of all possible 
features needed, even by using ISOcat, also because I introduced some 
helper POS tags for rules myself. For this reason, I think that values 
and keys should be configurable per tagset. Also, if someone uses a 
feature for Polish in grammar file for English, it should be 
automatically disallowed. We can enforce this configurability using XML 
namespaces.

At the same time, I don't think that mapping is ever required. What we 
need is one-time parsing of POS tags into key-value pairs, which 
basically boils down to storing some values in class members. Then a 
standard getter would be enough, and that is really computationally cheap.

Let me give an example. This is a POS tag for a preposition that 
requires accusative:

prep:acc

We would store the following values:

pos = preposition

case = accusative

A slightly more complex problem is that for some tags, we get 
alternative readings (due to syntactic ambiguity), so we might have a an 
adjective that shares its form in accusative and nominative. One very 
easy way to deal with this is to use simple binary operations on 
constants (think of cases as bit flags). So a very easy binary operation 
would be enough to check whether the values are there.

For all I can see, no HashMaps are required at all, just a consistent 
way of understanding the values in class members.

Regards,
Marcin

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


readable POS tags

2014-03-25 Thread Daniel Naber
Hi,

I've written an overview of how we could use readable POS tags in LT:

http://wiki.languagetool.org/readable-part-of-speech-tags

The core part however - how do these new POS tags actually look like - 
is still missing. Any input on the overview and ideas about that core 
part is welcome.

Regards
  Daniel


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel