https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86904
Bug ID: 86904 Summary: Column numbers ignore tab characters Product: gcc Version: unknown Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: dmalcolm at gcc dot gnu.org Target Milestone: --- As noted in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19165#c21 : /* Both gcc and emacs number source *lines* starting at 1, but they have differing conventions for *columns*. GCC uses a 1-based convention for source columns, whereas Emacs's M-x column-number-mode uses a 0-based convention. For example, an error in the initial, left-hand column of source line 3 is reported by GCC as: some-file.c:3:1: error: ...etc... On navigating to the location of that error in Emacs (e.g. via "next-error"), the locus is reported in the Mode Line (assuming M-x column-number-mode) as: some-file.c 10% (3, 0) i.e. "3:1:" in GCC corresponds to "(3, 0)" in Emacs. */ In terms of 0 vs 1, GCC complies with the GNU standards here: https://www.gnu.org/prep/standards/html_node/Errors.html However our "column numbers" are also simply a 1-based byte-count, so a tab character is treated by us as simply an increment of 1 right now. (see also PR 49973, which covers the case of multibyte characters). It turns out that we convert tab characters to *single* space characters when printing source code. This behavior has been present since Manu first implemented -fdiagnostics-show-caret in r186305 (aka 5a9830842f69ebb059061e26f8b0699cbd85121e, PR 24985), where it was this logic (there in diagnostic.c's diagnostic_show_locus): char c = *line == '\t' ? ' ' : *line; pp_character (context->printer, c); (that logic is now in diagnostic-show-locus.c in layout::print_source_line) This is arguably a bug, but it's intimately linked to the way in which we track "column numbers". Our "column numbers" are currently simply a 1-based byte-count, I believe, so a tab character is treated by us as simply an increment of 1 right now. There are similar issues with encodings that aren't 1 byte per character (e.g. non-ASCII unicode characters), which are being tracked in PR 49973. Presumably, when we print source lines containing tab characters, we should emit a number of spaces to reach a tab stop. Consider a diagnostic with a multiline range covering the following source (and beyond): indent: 6 (tabs: 0, spaces: 6) indent: 7 (tabs: 0, spaces: 7) indent: 8 (tabs: 1, spaces: 0) indent: 9 (tabs: 1, spaces: 1) i.e.: " indent: 6 (tabs: 0, spaces: 6)\n" " indent: 7 (tabs: 0, spaces: 7)\n" "\tindent: 8 (tabs: 1, spaces: 0)\n" "\t indent: 9 (tabs: 1, spaces: 1)\n" Currently diagnostic_show_locus prints: indent: 6 (tabs: 0, spaces: 6) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ indent: 7 (tabs: 0, spaces: 7) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ indent: 8 (tabs: 1, spaces: 0) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ indent: 9 (tabs: 1, spaces: 1) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ i.e: " indent: 6 (tabs: 0, spaces: 6)\n" " ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" " indent: 7 (tabs: 0, spaces: 7)\n" " ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" " indent: 8 (tabs: 1, spaces: 0)\n" " ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" " indent: 9 (tabs: 1, spaces: 1)\n" " ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" which misrepresents the indentation of the user's code. It should respect tabstops, and print: indent: 6 (tabs: 0, spaces: 6) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ indent: 7 (tabs: 0, spaces: 7) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ indent: 8 (tabs: 1, spaces: 0) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ indent: 9 (tabs: 1, spaces: 1) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ i.e.: " indent: 6 (tabs: 0, spaces: 6)\n" " ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" " indent: 7 (tabs: 0, spaces: 7)\n" " ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" " indent: 8 (tabs: 1, spaces: 0)\n" " ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" " indent: 9 (tabs: 1, spaces: 1)\n" " ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" We should also handle erroneous leading spaces before a tab, so that e.g. " \tfoo" should be printed as if it were: "\tfoo" (given that that's what the user's editor is probably printing it as). Similarly, we should presumably print "8" for the column number for the 'f' of "foo". However, IDEs are expecting GCC's existing behavior, so we should probably add a command-line option for controlling this. Adding a left margin with line numbers (as of r263450) doesn't change this bug, but makes fixing it slightly more complicated. Maybe: -fdiagnostics-x-coord=bytes : count of bytes -fdiagnostics-x-coord=characters : count of characters (not special-casing tab) -fdiagnostics-x-coord=columns : count of columns: as per characters, but with tabs doing tabstops (currently we use "bytes" Not sure if we need "characters") I'm thinking that internally, we should continue to track byte offsets, but make the option affect the presentation of the number in diagnostics*.c. (should it affect -fdiagnostics-parseable-fixits ? see also the Emacs RFE for parseable fixits: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25987 )