Daniel Naber <list2...@danielnaber.de> wrote: > Hi, > > we have three languages with grammar files that are more > than 1 MB large (German, French, Catalan). The German > grammar.xml has more than 24,000 lines. This size makes > editing the files difficult. I have some ideas on how > to improve the situation and I'm looking for other ideas > and comments: > > Step 1 - the easy one > > We can make the syntax a bit more compact and readable > by changing some elements: > > <marker> => <m> > <suggestion> => <s> > <example type="correct"> => <right> > <example type="incorrect"> => <wrong> >
It will only marginally reduce size. But shorter add less noise so it's clearer in my opinion. <m> and <s> may look less readable than <marker> and <suggestion> but since rule developers use them all the time, they would be well familiar with them. With short XML tags, it may become even more readable to write several tokens on one line. <pattern> <token>foo</token> <token>bar</token> </patttern> ... can be become... <p> <t>foo</t> <t>bar</t> </p> Maybe that's better. However, token often have attributes like inflected="yes" regexp="yes" so abbreviating <token> into <t> does not save much in that case. Maybe we could have something like... <t> == <token> <tr> == <token regexp="yes"> <ti> == <token inflected="yes"> <tir> == <token inflected="yes" regexp="yes"> ... but I don't like it. What about skip="...", postag="..." for example. > Step 2 - less repetition (also easy to implement) > > The contents of <message>, <url>, and <short> should be > inherited from a <rulegroup> element to its <rule> elements. > This way those elements do not need to be repeated if > the are the same for all rules of a rulegroup. Step 2 will also have a marginal impact I think. But it will also make rules less clear I think, having to jump attention possibly many lines above when trying to understand a rule. I don't like it so much. > Step 3 - an XML-free pattern > > Add a compact way to describe simple patterns. This is best explained by > example. What is now this: > > <pattern> > <token regexp="yes">foo|bar</token> > <marker> > <token>myerror</token> > </marker> > </pattern> > > ...could be written like this: > > <p>re:foo|bar _myerror_</p> > > Thus you don't need "<token>" at all as a whitespace implies a token > boundary. The prefix "re:" turns on regular expression matching (the same > for "pos:" -> POS tag, "pos:re:" -> POS tag regex). "<marker>" is replaced > by underscores. This does not support exceptions and other advanced > features, but it turns a 6-line rule into a 1-line rule. This new syntax is > optional, i.e. the old one can still be used. It looks like a profound change, which is hard to assess with a short description. It's worth looking at how Lightproof defines rules in a much shorter way than LT xml rules. Rules with multiple tokens in Lightproof often are a single line. But I don't think that trying to reduce the size of xml rules alone is a valid reasons for such a profound change. If both XML syntax and a new syntax are supported at the same time, it would be messy and harder to understand for new comers. Better have one way to do something. > What do you think about that? Other suggestions for making rule syntax more > compact? How about extending the XML syntax in order to allows writing the same checkers with less rules. I proposed one idea a while ago that would help in that direction and which was to allow multiple substitutions. It was discussed here: http://sourceforge.net/mailarchive/message.php?msg_id=2823573 Multiple substitutions could reduce the number of rules in several cases. There was also the idea of introducing <or>...</or> (a bit similar to already existing <and>...</and> tags) which in some cases can help reducing the number of rules, but that would admittedly not reduce by much. This is useful when one token must match either by string *OR* by POS tag. With current xml syntax, the only way to achieve that is to have multiple rules. > Indeed my editor cannot properly edit the files anymore. > But the redundancy also makes rules harder to read and to write A text editor that cannot handle a 1 Mb XML file is not a real text editor :-) Time to look at other editors? I have no speed problem editing > 1 Mb XML files with Vim. The full French rules/fr/rules.xml (1.3 Mb) stands on one single screen when all categories are folded, which gives a nice quick overview of the structure of the entire file as shown here: http://dominique.pelle.free.fr/pic/grammar-folded-vim.png Regards -- Dominique ------------------------------------------------------------------------------ Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_123012 _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel