Re: Finding rules

Marcin Miłkowski Mon, 24 Mar 2014 02:59:18 -0700

W dniu 2014-03-23 21:30, Dave Pawson pisze:
>> Right. Remember, however, that integrating corrections will not be
>> trivial then. What I mean is that LT displays the position of the
>> mistake (also in its XML output) which can be used to highlight the
>> error. If you remove any content with a stylesheet, then the initial
>> position may be skewed, and highlights will show in random places
>> because LT won't see the markup. This is why a stylesheet is not really
>> the way to write an AnnotatedText parser for us. We rather need to parse
>> docbook with some special Java code, which might be simple anyway.
>
> Agreed. But as an example I have a 500Kword document, one main file
> and 40 xincluded files. So line numbers in the original are 'wrong' in most
> error reports?


No, they should not be wrong but our xml cleaning code is very old so I 
cannot be sure.

Anyway, xincluded files won't be checked using the text mode, of course.

>      For syntax errors I normally note the text then grep in the files to
> find the original source  of the error?

This isn't the most user-friendly way of doing things, right? I think we 
should get a plugin for your docbook editing software, whatever that is 
(we already have a plugin for vim, if that's your choice). If this is a 
plain XML editor in Java, it should be fairly easy to do.

Alternatively, we can add some special error-related elements to docbook 
itself, which is a bit more tricky (we'd need to use our XML output with 
the docbook file in a special stylesheet) but if the format supports 
this kind of additional markup, then it might be easy. I don't know if 
docbook has this kind of annotation, XLIFF definitely has.

For both use scenarios, we need to retain proper error locations, so we 
should not use XSLT for conversion unless XSLT creates an intermediary 
format with positions hard-coded as attributes, and then we would have a 
Java parser for the intermediary format. This might have an advantage of 
being able to write up an XSLT for just any XML format easily instead of 
creating separate Java parsers.

>
>>
>> Right. This is just because the tag is split with an end-of-line marker.
>> You're apparently using -b parameter which breaks at a single
>> end-of-line marker, but this is wrong for your files.
>
> ?? I don't think I am using -b (I am not on my main machine, I will check).
> Does the rule 'reset' at end of line? That sounds wrong for plain text?

It depends on how your plain text file looks. Some use two end of line 
markers for the end of paragraph, some only one. We have these two settings.

However, for XML input it may be the case that end of line markers 
should be completely ignored during text segmentation. Actually, we 
almost could ignore these as the text is segmented independently from 
the rules. But I frankly don't know whether EOLs have any use in docbook 
or not. They don't have any in xhtml...

I say we "almost could" because there's code that we additionally run 
for end of lines, and we could simply skip it, but only in the next 
release it's possible to add the option to the command-line (and other 
places) because we're in the feature freeze period now.

Regards,
  Marcin


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Finding rules

Reply via email to