trentb...@gmail.com (Trent W. Buck) writes: >>> I really do think that the "darcs replace" in its current form is too >>> dangerous to use, because absolutely nothing has the exact lexical >>> structure assumed by it.
Now that I've thrown down the gauntlet, let me briefly outline my ideas of how a "real" darcs replace would work. That is, a semantic replacement operation that can be commuted with other replace and hunk patches, producing expected/intuitive results. This proposal assumes we have infinite developer and CPU hours. Here in the real world, I don't expect we'll EVER have the resources to implement it. Probably the reference for this form of patch are the XSLT and XPATH standards, which perform similar (and more powerful) functionality, but only for XML. It has to understand encodings. =============================== It's not acceptable to assume everything is a byte string, because a replace patch might be created in a JIS locale, be applied in a UTF-16 locale, and be operating on a UTF-8 file. I can't even attempt to reason about such an scenario if the source file, and old and new tokens, are considered unencoded byte strings. Implications: - The encoding can be guessed from the locale. - The user needs to be able to override this on the CLI. - The patch file will need to include the patch's assumed encoding. - Darcs will need to include a list of encodings. This can probably be farmed off to libiconv. - Something sensible should be done when applying the patch to a file that contains an invalid byte sequence (for that encoding). For example, if the patch expects ASCII, what should it do when encountering a byte with the high bit? - Something sensible should be done when a patch uses an encoding that the running Darcs doesn't know about. Suppose Darcs 3.0 only knows about UTF-8 and ASCII, and is asked to apply a patch created by Darcs 3.1 on a file encoded in ISO 8859-7? It has to do understand lexing. =============================== I gave examples upthread of how treating files as ATOMs separated by folding WHITESPACE is simply unacceptable even for relatively trivial file formats like sexprs, mexprs or csv. I'll tentatively propose XML's default serialization format as a example of an "easy" format to support. Implications: - Guessing the file format can be farmed off to libmagic. - The user needs to be able to override the guess on the CLI. - The patch file will need to include the patch's assumed file format. - Darcs will need a list of lexers. This might be farmed off piecemeal (e.g. libexpat for XML). - The lexical format of the file needs to be widely agreed-upon. For example, you could never have a "Scheme" file format, because is this a COMMENT or an OCTOTHORPE, COMMENT and an ATOM? #; FNORD Contrariwise, a specific Scheme lexer (e.g. R6RS) might be included; but what are the implications if your file is intended for a compiler/interpreter that implements extensions to that file format? We are back to the problem of unexpected results after commutation. - Similarly, if the file format's standard requires read-time evaluation (e.g. CL), it might be impossible to lex in finite time. - What happens if the file is lexically invalid (e.g. an unclosed brace)? It would be awesome if a patch that introduces lexical invalidity would be "clumped" with a subsequent patch that makes it valid again, such that the replacement patch can commute before or after the clump, but not within it. - How should COMMENT lexemes be handled? A typical lexer library will discard them before Darcs sees the token stream, so probably you wouldn't be able to replace comments. It has to know the token type being replaced. ============================================= Given all the above, this ought to be pretty straightforward. Consider a C file containing '+' + 'a' You want the tokenizer to replace the CHAR + token, but not the OP + token. _______________________________________________ darcs-users mailing list darcs-users@darcs.net http://lists.osuosl.org/mailman/listinfo/darcs-users