W dniu 2014-03-23 21:30, Dave Pawson pisze: >> Right. Remember, however, that integrating corrections will not be >> trivial then. What I mean is that LT displays the position of the >> mistake (also in its XML output) which can be used to highlight the >> error. If you remove any content with a stylesheet, then the initial >> position may be skewed, and highlights will show in random places >> because LT won't see the markup. This is why a stylesheet is not really >> the way to write an AnnotatedText parser for us. We rather need to parse >> docbook with some special Java code, which might be simple anyway. > > Agreed. But as an example I have a 500Kword document, one main file > and 40 xincluded files. So line numbers in the original are 'wrong' in most > error reports?
No, they should not be wrong but our xml cleaning code is very old so I cannot be sure. Anyway, xincluded files won't be checked using the text mode, of course. > For syntax errors I normally note the text then grep in the files to > find the original source of the error? This isn't the most user-friendly way of doing things, right? I think we should get a plugin for your docbook editing software, whatever that is (we already have a plugin for vim, if that's your choice). If this is a plain XML editor in Java, it should be fairly easy to do. Alternatively, we can add some special error-related elements to docbook itself, which is a bit more tricky (we'd need to use our XML output with the docbook file in a special stylesheet) but if the format supports this kind of additional markup, then it might be easy. I don't know if docbook has this kind of annotation, XLIFF definitely has. For both use scenarios, we need to retain proper error locations, so we should not use XSLT for conversion unless XSLT creates an intermediary format with positions hard-coded as attributes, and then we would have a Java parser for the intermediary format. This might have an advantage of being able to write up an XSLT for just any XML format easily instead of creating separate Java parsers. > >> >> Right. This is just because the tag is split with an end-of-line marker. >> You're apparently using -b parameter which breaks at a single >> end-of-line marker, but this is wrong for your files. > > ?? I don't think I am using -b (I am not on my main machine, I will check). > Does the rule 'reset' at end of line? That sounds wrong for plain text? It depends on how your plain text file looks. Some use two end of line markers for the end of paragraph, some only one. We have these two settings. However, for XML input it may be the case that end of line markers should be completely ignored during text segmentation. Actually, we almost could ignore these as the text is segmented independently from the rules. But I frankly don't know whether EOLs have any use in docbook or not. They don't have any in xhtml... I say we "almost could" because there's code that we additionally run for end of lines, and we could simply skip it, but only in the next release it's possible to add the option to the command-line (and other places) because we're in the feature freeze period now. Regards, Marcin ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel