Big in size, etc

2014-01-29 Thread Kumara Bhikkhu

Hope this is not too greedy.

rule id=BIG_IN_SIZE name=Big in size, etc
pattern
token postag_regexp=yes 
postag=JJ|JJR|JJSexception 
regexp=yessimilar|alone/exceptionexception postag=VBN//token

tokenin/token
token 
regexp=yessize|duration|color|colour|number|shape|height|nature|length|weight/token

/pattern
messageA more concise phrase may lose no meaning and sound 
more powerful./message

suggestion\1/suggestion
shortPossible redundancy/short
example type=correctThe man is big./example
example type=correctHis speech was briefest./example
example type=correctHe's absorbed in thought./example
example correction=big type=incorrectThe man is 
markerbig in size/marker./example
example correction=briefest type=incorrectHis speech 
was markerbriefest in duration/marker./example
example correction=redder type=incorrectMy car is 
markerredder in color/marker./example
example correction=few type=incorrectHer friends were 
markerfew in number/marker./example

/rule --
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


They got it right; they got it wrong.

2014-01-29 Thread Kumara Bhikkhu
False alarm: They got it right; they got it wrong.

right and wrong are wrongly flagged.

kb


--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: How to use the lemmatizer

2014-01-29 Thread Richard Eckart de Castilho
Hi,

thanks for the feedback. I'm really only interested in getting a convenient 
access to the dictionaries, so that I can use them for lemmatization. For this 
particular task, I'm not using any other functionality from LanguageTool, 
including grammatical rules.

So here is what I do:

- run a probabilistic POS tagger
- feed the tokens of my text to the languagetool tagger to get all dictionary 
entries
- find a match between the POS tag created by the probabilistic tagger and the 
returned dictionary entries
- if there is a match, use the respective lemma

Matching is the most annoying part, because the tagset used by the 
probabilistic tagger may not be the same as the one used in the LanguageTool 
dictionary. So now try three matching approaches:

- checking if the POS tag from tagger and the one from dictionary are exactly 
the same?
- checking if the POS tag from the tagger is the same as the first element of 
the dictionary tag (splitting by ':')
- using mapping tables to map both, the tag from the POS tagger and the tag 
from the dictionary, to a coarse-grained scheme of word classes and see if they 
match there

Seems to work quite ok.

Cheers,

-- Richard

On 27.01.2014, at 22:38, Marcin Miłkowski list-addr...@wp.pl wrote:

 Hello,
 
 W dniu 2014-01-27 15:44, Richard Eckart de Castilho pisze:
 Hello everybody,
 
 I may be totally wrong, but I believe the lemmatizers in LanguageTool are 
 implemented based on dictionaries. I suppose a dictionary entry would be 
 made up of a form, a lemma, and a pos tag.
 
 Assuming this is correct, is there a simple way to do a lookup in such a 
 dictionary?
 
 Also, is there a way to find out which tagsets are used by these 
 dictionaries (or maybe there is even some standard in LanguageTool, e.g. 
 verbs are always V and nouns are always N or something like that)?
 
 I would like a method that accepts an inflected form and a pos tag and that 
 returns a single lemma.
 
 
 Currently, I am doing this, but it seems a bit awkward.
 
 ListAnalyzedTokenReadings rawTaggedTokens = 
 lang.getTagger().tag(tokenText);
 AnalyzedSentence as = new AnalyzedSentence(
   rawTaggedTokens.toArray(new 
 AnalyzedTokenReadings[rawTaggedTokens.size()]));
 as = lang.getDisambiguator().disambiguate(as);
 String best = getMostFrequentLemma(as.getTokens()[i]);
 
 In particular, I would like to use a different POS tagger. I have various 
 statistical POS taggers at my disposal that produce a single POS per token - 
 and that is what I want. The LanguageTool POS tagger produces multiple 
 unranked POS tags per token.
 
 Beware that statistical POS taggers will necessarily obfuscate 
 non-grammatical material, as they try to guess the correct tags. This 
 makes them quite useless for writing rules. We've been there, tried 
 that. I haven't yet found a decent English POS tagger, for example, that 
 would be useful.
 
 Note however that if you have frequency info, you can add it to your 
 tagger dictionary. And we indeed can do so using typing frequency lists, 
 so you'd be able to assign the most frequent lemma if you need, I guess. 
 The procedure is described here:
 
 http://wiki.languagetool.org/hunspell-support
 
 See under including frequency data.
 
 Regards,
 Marcin


--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


GSoC 2014

2014-01-29 Thread Daniel Naber
Hi,

there's a Google Summer of Code again this year. If we decide to apply 
as a mentoring organization, the most important thing is to update this 
page:

http://wiki.languagetool.org/missing-features

Please update it anyway, add your ideas, your just post them here if you 
don't have a Wikidot account. So should we apply?

The GSoC timeline is at 
http://www.google-melange.com/gsoc/events/google/gsoc2014.

Regards
  Daniel


--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: They got it right; they got it wrong.

2014-01-29 Thread Marcin Miłkowski
W dniu 2014-01-29 10:50, Kumara Bhikkhu pisze:
 False alarm: They got it right; they got it wrong.

 right and wrong are wrongly flagged.

Yes, I fixed it some time ago. I made really serious changes in 
disambiguation so please use the daily builds:

https://www.languagetool.org/download/snapshots/?C=M;O=D

Best,
Marcin

--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Try this

2014-01-29 Thread Marcin Miłkowski
W dniu 2014-01-29 03:56, Kumara Bhikkhu pisze:
 Marcin Miłkowski wrote thus at 08:06 PM 28-01-14:
 Try:
token inflected=yesconduct/token
 instead.

 Thanks for the tip. I was wondering about it.


 I tried your rule using our rule editor:
 http://community.languagetool.org/ruleEditor/expert
 and we have lots of matches, but some of them are quite useless, for
 example:

 Mozart's Davide penitente (1785), his Piano Concerto KV 482 (1785), the
 Clarinet Quintet (1789) and the 40th Symphony (1788) had been premiered
 on the suggestion of Salieri, who supposedly conducted a performance of
 it in 1791. (wikipedia)

 It's relevant actually. It can be revised as who
 supposedly performed it in 1791. Unnecessary
 nominalisation is so common these days that most
 people put up with it to the point of ignoring
 it.

I'm afraid you're wrong. Salieri couldn't perform the 40th Symphony by 
himself. He was conductor of the performance, so the only possible way 
to say this correctly is to say that he conducted a performance of the 
symphony rather that he performed it. It makes no sense to think that a 
single person performs a symphony! ;)

 Students often use it to make their writing
 sound more pompous, which is counterproductive.

I agree, this is pompous in many cases.



 You can use the following queries on corpus.byu.edu corpora:
 conduct a|an|the|no [n*] of|into
 conduct a|an|the|no [n*] * of|into
 conduct a|an|the|no [n*] * * of|into
 conduct a|an|the|no [n*] * * * of|into

 Yes, I'm using that (minus the, which suggest
 that the noun refers to a specific thing
 mentioned earlier, thus the nominalisation is necessary).

Yeah, you're right. As you can see there, we also should add relevant 
exceptions to your rule ('deal' - from a great deal, 'range', etc.). 
So could you please look at the corpora results again?

Best,
Marcin



--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Big in size, etc

2014-01-29 Thread Marcin Miłkowski
W dniu 2014-01-29 10:35, Kumara Bhikkhu pisze:
 Hope this is not too greedy.

 rule id=BIG_IN_SIZE name=Big in size, etc
 pattern
 token postag_regexp=yes postag=JJ|JJR|JJSexception
 regexp=yessimilar|alone/exceptionexception postag=VBN//token
 tokenin/token
 token
 regexp=yessize|duration|color|colour|number|shape|height|nature|length|weight/token
 /pattern
 messageA more concise phrase may lose no meaning and sound more
 powerful./message
 suggestion\1/suggestion
 shortPossible redundancy/short
 example type=correctThe man is big./example
 example type=correctHis speech was briefest./example
 example type=correctHe's absorbed in thought./example
 example correction=big type=incorrectThe man is markerbig in
 size/marker./example
 example correction=briefest type=incorrectHis speech was
 markerbriefest in duration/marker./example
 example correction=redder type=incorrectMy car is markerredder
 in color/marker./example
 example correction=few type=incorrectHer friends were markerfew
 in number/marker./example
 /rule

Thanks!

Yes, this is much better but we also match:

rich in color

few in number

possible in Nature,

corresponding in number

religious in nature

efficient in shape

autocratic in nature,

equal in size

beautiful in color,

next in size

Out of 16 matches on Brown corpus, the useful ones were brown in 
color,tawny in color, two in number. I added the adjectives above 
as additional exceptions and we'll see how it performs. I'm not sure 
about 'few'. We'll probably see more matches in the nightly diff today.

Regards,
Marcin

--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


homepage usability test

2014-01-29 Thread Daniel Naber
Hi,

I ran a usability test on our homepage using http://rapidusertests.com. 
They let users use the website and record that as a video. It's in 
German, but if you're interested in the results, let me know and I'll 
send you a link.

Users' complaints/remarks were:
-lack of an About page
-errors are lost when switching the full-screen mode and back
-you need to re-check the text after correcting an error manually
-sometimes they don't get that the demo text is a demo and thus contains 
errors on purpose

Regards
  Daniel



--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Big in size, etc

2014-01-29 Thread Kumara Bhikkhu

Of what you listed, these are rightly flagged:
   * few in number
   * religious in nature
   * autocratic in nature,
Removing the 2nd and 3rd token wouldn't cause any lost of meaning.

kb

Marcin Miłkowski wrote thus at 01:33 AM 30-01-14:

W dniu 2014-01-29 10:35, Kumara Bhikkhu pisze:
 Hope this is not too greedy.

 rule id=BIG_IN_SIZE name=Big in size, etc
 pattern
 token postag_regexp=yes postag=JJ|JJR|JJSexception
 regexp=yessimilar|alone/exceptionexception postag=VBN//token
 tokenin/token
 token
 
regexp=yessize|duration|color|colour|number|shape|height|nature|length|weight/token

 /pattern
 messageA more concise phrase may lose no meaning and sound more
 powerful./message
 suggestion\1/suggestion
 shortPossible redundancy/short
 example type=correctThe man is big./example
 example type=correctHis speech was briefest./example
 example type=correctHe's absorbed in thought./example
 example correction=big type=incorrectThe man is markerbig in
 size/marker./example
 example correction=briefest type=incorrectHis speech was
 markerbriefest in duration/marker./example
 example correction=redder type=incorrectMy car is markerredder
 in color/marker./example
 example correction=few type=incorrectHer friends were markerfew
 in number/marker./example
 /rule

Thanks!

Yes, this is much better but we also match:

rich in color

few in number

possible in Nature,

corresponding in number

religious in nature

efficient in shape

autocratic in nature,

equal in size

beautiful in color,

next in size

Out of 16 matches on Brown corpus, the useful ones were brown in
color,tawny in color, two in number. I added the adjectives above
as additional exceptions and we'll see how it performs. I'm not sure
about 'few'. We'll probably see more matches in the nightly diff today.

Regards,
Marcin

--
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel
--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: homepage usability test

2014-01-29 Thread Kumara Bhikkhu
Kumara Bhikkhu wrote thus at 12:52 PM 30-01-14:
Daniel Naber wrote thus at 06:34 AM 30-01-14:
-sometimes they don't get that the demo text is a demo and thus contains
errors on purpose

I too had the same experience, though not for the demo text on the 
old homepage.

I think revising the text with
 Check this text too see an
 Or paste your own text here.
would minimise the possibility.

kb 


--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel