[Bug other/86904] New: Column numbers ignore tab characters

dmalcolm at gcc dot gnu.org Thu, 09 Aug 2018 13:09:02 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86904


            Bug ID: 86904
           Summary: Column numbers ignore tab characters
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: diagnostic
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dmalcolm at gcc dot gnu.org
  Target Milestone: ---

As noted in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19165#c21 :

/* Both gcc and emacs number source *lines* starting at 1, but
   they have differing conventions for *columns*.

   GCC uses a 1-based convention for source columns,
   whereas Emacs's M-x column-number-mode uses a 0-based convention.

   For example, an error in the initial, left-hand
   column of source line 3 is reported by GCC as:

      some-file.c:3:1: error: ...etc...

   On navigating to the location of that error in Emacs
   (e.g. via "next-error"),
   the locus is reported in the Mode Line
   (assuming M-x column-number-mode) as:

     some-file.c   10%   (3, 0)

   i.e. "3:1:" in GCC corresponds to "(3, 0)" in Emacs.  */

In terms of 0 vs 1, GCC complies with the GNU standards here:
https://www.gnu.org/prep/standards/html_node/Errors.html

However our "column numbers" are also simply a 1-based byte-count, so a tab
character is treated by us as simply an increment of 1 right now.

(see also PR 49973, which covers the case of multibyte characters).

It turns out that we convert tab characters to *single* space characters when
printing source code.

This behavior has been present since Manu first implemented
-fdiagnostics-show-caret in r186305 (aka
5a9830842f69ebb059061e26f8b0699cbd85121e, PR 24985), where it was this logic
(there in diagnostic.c's diagnostic_show_locus):
      char c = *line == '\t' ? ' ' : *line;
      pp_character (context->printer, c);

(that logic is now in diagnostic-show-locus.c in layout::print_source_line)

This is arguably a bug, but it's intimately linked to the way in which we track
"column numbers".

Our "column numbers" are currently simply a 1-based byte-count, I believe, so a
tab character is treated by us as simply an increment of 1 right now.

There are similar issues with encodings that aren't 1 byte per character (e.g.
non-ASCII unicode characters), which are being tracked in PR 49973.

Presumably, when we print source lines containing tab characters, we should
emit a number of spaces to reach a tab stop.

Consider a diagnostic with a multiline range covering the
following source (and beyond):

      indent: 6 (tabs: 0, spaces: 6)
       indent: 7 (tabs: 0, spaces: 7)
        indent: 8 (tabs: 1, spaces: 0)
         indent: 9 (tabs: 1, spaces: 1)

i.e.:

  "      indent: 6 (tabs: 0, spaces: 6)\n"
  "       indent: 7 (tabs: 0, spaces: 7)\n"
  "\tindent: 8 (tabs: 1, spaces: 0)\n"
  "\t indent: 9 (tabs: 1, spaces: 1)\n"

Currently diagnostic_show_locus prints:

       indent: 6 (tabs: 0, spaces: 6)
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        indent: 7 (tabs: 0, spaces: 7)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  indent: 8 (tabs: 1, spaces: 0)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   indent: 9 (tabs: 1, spaces: 1)
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

i.e:
  "       indent: 6 (tabs: 0, spaces: 6)\n"
  "       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "        indent: 7 (tabs: 0, spaces: 7)\n"
  "        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "  indent: 8 (tabs: 1, spaces: 0)\n"
  "  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "   indent: 9 (tabs: 1, spaces: 1)\n"
  "   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"

which misrepresents the indentation of the user's code.

It should respect tabstops, and print:

       indent: 6 (tabs: 0, spaces: 6)
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        indent: 7 (tabs: 0, spaces: 7)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         indent: 8 (tabs: 1, spaces: 0)
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          indent: 9 (tabs: 1, spaces: 1)
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

i.e.:

  "       indent: 6 (tabs: 0, spaces: 6)\n"
  "       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "        indent: 7 (tabs: 0, spaces: 7)\n"
  "        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "         indent: 8 (tabs: 1, spaces: 0)\n"
  "         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
  "          indent: 9 (tabs: 1, spaces: 1)\n"
  "          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"

We should also handle erroneous leading spaces before a tab, so that e.g.

  "  \tfoo"

should be printed as if it were:

 "\tfoo"

(given that that's what the user's editor is probably printing it as).

Similarly, we should presumably print "8" for the column number for the 'f' of
"foo".  However, IDEs are expecting GCC's existing behavior, so we should
probably add a command-line option for controlling this.

Adding a left margin with line numbers (as of r263450) doesn't change this bug,
but makes fixing it slightly more complicated.

Maybe:
  -fdiagnostics-x-coord=bytes : count of bytes
  -fdiagnostics-x-coord=characters : count of characters (not special-casing
tab)
  -fdiagnostics-x-coord=columns : count of columns: as per characters, but with
tabs doing tabstops
(currently we use "bytes"  Not sure if we need "characters")

I'm thinking that internally, we should continue to track byte offsets, but
make the option affect the presentation of the number in diagnostics*.c.

(should it affect -fdiagnostics-parseable-fixits ?
see also the Emacs RFE for parseable fixits:
  https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25987 )

[Bug other/86904] New: Column numbers ignore tab characters

Reply via email to