Re: Finding rules

Marcin Miłkowski Mon, 24 Mar 2014 11:27:23 -0700

W dniu 2014-03-24 18:14, Dave Pawson pisze:
> On 24 March 2014 16:29, Marcin Miłkowski <list-addr...@wp.pl> wrote:
>
>> Well, there is a (partially broken) emacs plugin:
>>
>> http://www.emacswiki.org/emacs/langtool.el
>>
>> I'm not really into emacs lisp, so I wasn't able to make it run
>> flawlessly but you might want to use it. Should be easier than grep, as
>> this parses LT output directly.
>
> No thanks, I author XML in emacs, never process it.


I thought it might be helpful. This does *not* process XML or anything: 
it just finds the next error location and displays a message. At least 
this is what it did for me after I managed to run it.

>
>
>>>> For both use scenarios, we need to retain proper error locations, so we
>>>> should not use XSLT for conversion unless XSLT creates an intermediary
>>>> format with positions hard-coded as attributes, and then we would have a
>>>> Java parser for the intermediary format. This might have an advantage of
>>>> being able to write up an XSLT for just any XML format easily instead of
>>>> creating separate Java parsers.
>>>
>>> I'm -1 on that. XML is white space agnostic (one of its benefits for me)
>>> so line numbers have less meaning?
>>
>> Error locations are not only line numbers but also column numbers. This
>> really helps software to underline errors as you type.
>
> I want to see it, not intermediate software. XML is line / ws agnostic
> so it is of little help really?

We're out of sync. The location is used instead of grep to highlight 
error positions in the file. Note: you might have the same sequence of 
words in your file eight times but only one might be incorrect. LT would 
highlight only one, grep all eight, so this saves your time and removes 
confusions. LT cares about context-dependence of errors, so this is not 
a science fiction scenario. For example, "that that" may be correct in 
English, but it also might be completely fine. LT tries to suppress 
false positives for this.

So whatever your format, location *is* important to avoid confusion and 
save the user's time.

>
>
>>
>>>
>>> Processing 'any' XML (to me) would be advantageous. Here the
>>> requirement would be simply to skip over elements/attributes (and
>>> comments, PI's perhaps?).
>>> then simply switch off the white space rule, since it is not
>>> applicable? Ditto the smart quote
>>> rule?
>>
>> Smart quote rule is fine if your output is for printing purposes. It
>> just depends on the language.
>
> XML is v.rarely used for presentation, without transformation first,
> so smart quotes are of little / no use.

I wrote some of my logic slides in XHTML, so...

>
>
>>>>> ?? I don't think I am using -b (I am not on my main machine, I will 
>>>>> check).
>>>>> Does the rule 'reset' at end of line? That sounds wrong for plain text?
>>>>
>>>> It depends on how your plain text file looks. Some use two end of line
>>>> markers for the end of paragraph, some only one. We have these two 
>>>> settings.
>>>
>>> I  think we are out of sync here? I am currently processing the XML file
>>> without stripping markup.
>>
>> You talk about plain text, I reply about plain text. Not about XML. For
>> plain text, there are reasons to look at end of line markers.
>
> Agreed. I have not, as yet, produced plain text from docbook XML,
> hence all my comments refer to processing XML.
>
>>
>>
>>>     Checking, I am not using the -b parameter.
>>> by shell script is
>>>
>>> #!/bin/bash
>>> langtools=/apps/langtools
>>> disRules="WHITESPACE_RULE"
>>> java -jar ${langtools}/languagetool-commandline.jar --language EN-GB
>>> -c utf-8  --disable $disRules $*
>>
>> I could not reproduce the error you mention without -b, but again, maybe
>> you have two EOLs in your file.
>
> I have lots of \n in the file, none of which are relevant?

Yep, for this format we should discard \n altogether for sentence 
segmentation. We can handle it perfectly inside the sentence.

>
>
>>
>>>
>>>
>>>>
>>>> However, for XML input it may be the case that end of line markers
>>>> should be completely ignored during text segmentation. Actually, we
>>>> almost could ignore these as the text is segmented independently from
>>>> the rules. But I frankly don't know whether EOLs have any use in docbook
>>>> or not. They don't have any in xhtml...
>>>
>>> No, whitespace is (mainly) ignored in XML, nl,TAB, sp etc.
>>
>> Unless of course we have xml:space="preserve".
>
>
> That's the 'mainly' caveat <grin/>
>
>>
>>>
>>>>
>>>> I say we "almost could" because there's code that we additionally run
>>>> for end of lines, and we could simply skip it, but only in the next
>>>> release it's possible to add the option to the command-line (and other
>>>> places) because we're in the feature freeze period now.
>>>
>>> Understood. If I can help please shout.
>>
>> After the release, I'll add the option to suppress EOL segmentation
>> altogether.
>
>
> Thanks, that would be a help.

I think we're getting some clear specs of what should be done for a sane 
generic XML filter:

- discard \n in segmentation, unless in xml:space="preserve";
- take care of standard entities;
- maybe disable whitespace rule by default.

regards
m.

>
> regards
>
>
>


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Finding rules

Reply via email to