Re: [darcs-users] darcs replace token complaint

Trent W. Buck Sun, 16 Aug 2009 02:17:53 -0700

trentb...@gmail.com (Trent W. Buck) writes:

>>> I really do think that the "darcs replace" in its current form is too
>>> dangerous to use, because absolutely nothing has the exact lexical
>>> structure assumed by it.


Now that I've thrown down the gauntlet, let me briefly outline my ideas
of how a "real" darcs replace would work.  That is, a semantic
replacement operation that can be commuted with other replace and hunk
patches, producing expected/intuitive results.  This proposal assumes we
have infinite developer and CPU hours.  Here in the real world, I don't
expect we'll EVER have the resources to implement it.

Probably the reference for this form of patch are the XSLT and XPATH
standards, which perform similar (and more powerful) functionality, but
only for XML.


It has to understand encodings.
===============================

It's not acceptable to assume everything is a byte string, because a
replace patch might be created in a JIS locale, be applied in a UTF-16
locale, and be operating on a UTF-8 file.  I can't even attempt to
reason about such an scenario if the source file, and old and new
tokens, are considered unencoded byte strings.

Implications:

 - The encoding can be guessed from the locale.
 - The user needs to be able to override this on the CLI.
 - The patch file will need to include the patch's assumed encoding.
 - Darcs will need to include a list of encodings.  This can probably be
   farmed off to libiconv.

 - Something sensible should be done when applying the patch to a file
   that contains an invalid byte sequence (for that encoding).  For
   example, if the patch expects ASCII, what should it do when
   encountering a byte with the high bit?

 - Something sensible should be done when a patch uses an encoding that
   the running Darcs doesn't know about.  Suppose Darcs 3.0 only knows
   about UTF-8 and ASCII, and is asked to apply a patch created by Darcs
   3.1 on a file encoded in ISO 8859-7?


It has to do understand lexing.
===============================

I gave examples upthread of how treating files as ATOMs separated by
folding WHITESPACE is simply unacceptable even for relatively trivial
file formats like sexprs, mexprs or csv.

I'll tentatively propose XML's default serialization format as a example
of an "easy" format to support.

Implications:

 - Guessing the file format can be farmed off to libmagic.
 - The user needs to be able to override the guess on the CLI.
 - The patch file will need to include the patch's assumed file format.
 - Darcs will need a list of lexers.  This might be farmed off piecemeal
   (e.g. libexpat for XML).

 - The lexical format of the file needs to be widely agreed-upon.  For
   example, you could never have a "Scheme" file format, because is this
   a COMMENT or an OCTOTHORPE, COMMENT and an ATOM?

     #;
     FNORD

   Contrariwise, a specific Scheme lexer (e.g. R6RS) might be included;
   but what are the implications if your file is intended for a
   compiler/interpreter that implements extensions to that file format?
   We are back to the problem of unexpected results after commutation.

 - Similarly, if the file format's standard requires read-time
   evaluation (e.g. CL), it might be impossible to lex in finite time.

 - What happens if the file is lexically invalid (e.g. an unclosed
   brace)?  It would be awesome if a patch that introduces lexical
   invalidity would be "clumped" with a subsequent patch that makes it
   valid again, such that the replacement patch can commute before or
   after the clump, but not within it.

 - How should COMMENT lexemes be handled?  A typical lexer library will
   discard them before Darcs sees the token stream, so probably you
   wouldn't be able to replace comments.


It has to know the token type being replaced.
=============================================

Given all the above, this ought to be pretty straightforward.
Consider a C file containing

    '+' + 'a'

You want the tokenizer to replace the CHAR + token, but not the OP +
token.

_______________________________________________
darcs-users mailing list
darcs-users@darcs.net
http://lists.osuosl.org/mailman/listinfo/darcs-users

Re: [darcs-users] darcs replace token complaint

Reply via email to