In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/7e2f38b21adec3a814f436b1c0c7e79796f5dcc5?hp=4623556176aef2c3bff081e1456fdf7f5c0695c1>
- Log ----------------------------------------------------------------- commit 7e2f38b21adec3a814f436b1c0c7e79796f5dcc5 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 19:57:53 2017 -0700 Make _byte_dump_string() usable in all of core I found myself needing this function for development debugging, which formerly was only usable from utf8.c. This enhances it to allow a second format type, and makes it core-accessible. M embed.fnc M embed.h M locale.c M proto.h M utf8.c commit 6b58f9be8c82394d776908727ff4e9d582f87f12 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 19:22:59 2017 -0700 toke.c: Make sure things are initialized Commit 3dd4eaeb8ac39e08179145b86aedda36584a3509 fixed a bug wherein the tr/// operator parsing code could be looking at uninitialized data. This happens only because we try to carry on when we find errors, so as to find as many errors as possible in a single run, as a convenience to the person debugging the script being compiled. And we failed to initialize stuff upon getting an error; stuff that was later looked at by tr///. That commit fixed the ticket by making sure the things mentioned there got initialized upon error, but didn't handle the various other places in the loop where the same thing could happen. At the time, I thought it would be easier to instead change the tr/// handling code to know that its inputs were problematic, and to avoid looking at them in that case. This is easily done, and would automatically catch all the cases in the loop, now and any added in the future. But then I thought, maybe tr/// isn't the only operator that could be thrown off by this. It is the most obvious one, to someone who knows how it goes about getting compiled; but there may be other operators that I don't know how they get compiled and have the same or a similar problem. The better solution then would be to extend 3dd4eaeb8ac39e08179145b86aedda36584a3509 to make sure everything gets initialized when there is an error. That is what this current commit does. The previous few commits have refactored things so as to minimize the number of places that need to be handled here, down to three. I kinda doubt that new constructs will be added, at this stage in the language development, that would require the same initialization handling. But, if they were, hopefully those doing it would follow the existing paradigm that this commit and 3dd4eaeb8ac39e08179145b86aedda36584a3509 establish. Another way to handle this would have been to, instead of doing an initialize-and-'continue', to instead jump to a common label at the bottom of the loop which does the initialization. I think it doesn't matter much which, so left it as this. M toke.c commit e90756be0bd7bbe5f7193c0e9e856c56ae09c085 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 19:18:38 2017 -0700 toke.c: Quit now if error at end of input In these two cases, we know we are at the end of the input, and that we have an error. There is no need to try to patch things up so we can continue to parse looking for other errors; there's nothing left to parse. So skip having to deal with patching up. M toke.c commit 4c3e84f3090c9cc68d352a2c80a95c91d25a8fc7 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 19:12:31 2017 -0700 toke.c: Un-special case something By refactoring slightly, we make this code in a switch statement have the same entrance and exit invariants as the other cases, so they all can be handled uniformly at the end of the switch. M toke.c commit f065e1e68bf6a5541c8ceba8c9fcc6e18f51a32b Author: Karl Williamson <[email protected]> Date: Mon Feb 13 19:01:46 2017 -0700 Don't try to compile a pattern known to be in error Regular expression patterns are parsed by the lexer/toker, and then compiled by the regex compiler. It is foolish to try to compile one that the parser has rejected as syntactically bad; assumptions may be violated and segfaults ensue. This commit abandons all parsing immediately if a pattern had errors in it. A better solution would be to flag this pattern as not to be compiled, and continue parsing other things so as to find the most errors in a single attempt, but I don't think it's worth the extra effort. Making this change caused some misleading error messages in the test suite to be replaced by better ones. M t/re/re_tests M toke.c commit d43328d502ac91c4d98e218d0721cd5f3bcd3950 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 18:49:52 2017 -0700 toke.c: Add internal function to abort parsing This is to be called to abort the parsing early, before the required number of errors have been found. It is used when continuing the parse would be either fruitless or we could be looking at garbage. M embed.fnc M embed.h M proto.h M toke.c commit ae389f219fdcdf1c117790fefa36a302c0d16edc Author: Karl Williamson <[email protected]> Date: Mon Feb 13 18:46:30 2017 -0700 toke.c: White-space only Indent after the previous commit enclosed this code in a new block. M toke.c commit 68da2a652769fdc14ed8787783c334789f2614d1 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 18:27:02 2017 -0700 Relax internal function API This changes yyerror_pvn so that its first parameter can be NULL. This indicates no message is to be output, but that parsing is to be abandoned immediately, without waiting for more errors to build up. M embed.fnc M proto.h M toke.c commit c77da5ff879f4bc56ececd521148ca08ab8ce837 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 13:18:38 2017 -0700 Extract code into a function This creates a function in toke.c to output the compilation aborted message, changing perl.c to call that function. This is in preparation for this to be called from a 2nd place M embed.fnc M embed.h M perl.c M pod/perldiag.pod M proto.h M toke.c commit f88c646639aaa9f3f90a801fa3aa5d1ce27fd3c0 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 13:03:32 2017 -0700 toke.c: Rmv no longer necessary UTF-8 checks The previous commit tightened up the checking for well-formed UTF8ness, so that the ones removed here were redundant. The test during a string eval may also no longer be necessary, but since there are many ways to create that string, I'm not confidant enough to remove it. M pod/perldiag.pod M toke.c commit 7f7dbdc9c79b3476c92af87ef8d1f9178f566fc5 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 20:40:57 2017 -0700 Add test for [perl #130675] M t/lib/croak/toke_l1 commit 0b75cb8099a8002299b9fedfe415323b9e21a5e0 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 20:37:47 2017 -0700 t/lib/croak/toke_l1: Cut down test The previous commits hardening toke.c against malformed UTF-8 input have allowed this test case to be cut down substantially M t/lib/croak/toke_l1 commit efa571ab2dff0efcc97f14f98ed11f9296cd27a7 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 12:35:18 2017 -0700 toke.c: Fix bugs where UTF-8 is turned on in mid chunk Previous commits have tightened up the checking of UTF-8 for well-formedness in the input program or string eval. This is done in lex_next_chunk and lex_start. But it doesn't handle the case of use utf8; foo because 'foo' is checked while UTF-8 is still off. This solves that problem by noticing when utf8 is turned on, and then rechecking at the next opportunity. See thread beginning at http://nntp.perl.org/group/perl.perl5.porters/242916 This fixes [perl #130675]. A test will be added in a future commit This catches some errors earlier than they used to be and aborts. so some tests in the suite had to be split into multiple parts. M embed.fnc M embed.h M lib/utf8.t M mg.c M parser.h M proto.h M sv.c M t/lib/croak/toke M t/lib/croak/toke_l1 M t/uni/parser.t M toke.c commit 5d382c252dbb26302398c7fe38021a0f3c06b809 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 12:16:45 2017 -0700 mg.c: PL_hints is unsigned Therefore it's dangerous to presume things fit into an IV. M mg.c commit 0aab20f283bf5c6088780bee353c6a125f50088b Author: Karl Williamson <[email protected]> Date: Mon Feb 13 12:02:54 2017 -0700 toke.c: Add branch prediction The input is far more likely to be well-formed than not. M toke.c commit ef3ff34de0f1e12e7961a83cbf63dcbb91808ba2 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 11:54:06 2017 -0700 toke.c: Fix comments describing S_tokeq The comments about what this function does were incorrect. M toke.c commit 07337b9591822033e937370211f01ca63f3243f1 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 11:51:59 2017 -0700 toke.c: Slight refactor. This moves an automatic variable to closer to the only place it is used; it also adds branch prediction. It is likely that the input will be well-formed. M toke.c commit 218304f90362a79b96bb58e1ef63c0f213451079 Author: Karl Williamson <[email protected]> Date: Mon Feb 13 11:29:35 2017 -0700 toke.c: White space, comments, braces I am adding the braces because in one of the areas, the lack of braces had led to a blead failure. M toke.c commit 7cc9fe9f24f654e4295a82ca84e289a5160cf10c Author: Karl Williamson <[email protected]> Date: Mon Feb 13 11:31:53 2017 -0700 toke.c: Don't compare same bytes twice Before starting this memEQ, we know that the first bytes are the same, so might as well start the compare with the 2nd bytes. M toke.c commit a6909bd2c28e2bd347ad942359c890e1b8ea6d4a Author: Karl Williamson <[email protected]> Date: Mon Feb 13 11:15:07 2017 -0700 toke.c: Move declaration This automatic variable doesn't need such a large scope. M toke.c ----------------------------------------------------------------------- Summary of changes: embed.fnc | 10 +- embed.h | 5 +- lib/utf8.t | 2 +- locale.c | 8 +- mg.c | 12 +- parser.h | 2 + perl.c | 7 +- pod/perldiag.pod | 10 +- proto.h | 13 +- sv.c | 1 + t/lib/croak/toke | 18 ++- t/lib/croak/toke_l1 | Bin 3193 -> 1127 bytes t/re/re_tests | 7 +- t/uni/parser.t | 2 +- toke.c | 368 +++++++++++++++++++++++++++++++--------------------- utf8.c | 41 +++--- 16 files changed, 306 insertions(+), 200 deletions(-) diff --git a/embed.fnc b/embed.fnc index d548f5d5e5..0f63ed0c0f 100644 --- a/embed.fnc +++ b/embed.fnc @@ -859,6 +859,7 @@ pP |I32 |keyword |NN const char *name|I32 len|bool all_keywords s |void |inplace_aassign |NN OP* o #endif Ap |void |leave_scope |I32 base +p |void |notify_parser_that_changed_to_utf8 : Public lexer API AMpd |void |lex_start |NULLOK SV* line|NULLOK PerlIO *rsfp|U32 flags AMpd |bool |lex_bufutf8 @@ -1714,6 +1715,10 @@ ApdD |UV |to_utf8_case |NN const U8 *p \ |NN SV **swashp \ |NN const char *normal| \ NULLOK const char *special +ApM |char * |_byte_dump_string \ + |NN const U8 * s \ + |const STRLEN len \ + |const bool format #if defined(PERL_IN_UTF8_C) inR |bool |does_utf8_overflow|NN const U8 * const s|NN const U8 * e inR |bool |is_utf8_overlong_given_start_byte_ok|NN const U8 * const s|const STRLEN len @@ -1723,7 +1728,6 @@ sMR |char * |unexpected_non_continuation_text \ |STRLEN print_len \ |const STRLEN non_cont_byte_pos \ |const STRLEN expect_len -sM |char * |_byte_dump_string|NN const U8 * s|const STRLEN len s |void |warn_on_first_deprecated_use \ |NN const char * const name \ |NN const char * const alternative \ @@ -1876,8 +1880,10 @@ inR |bool |should_warn_nl|NN const char *pv p |void |write_to_stderr|NN SV* msv : Used in op.c p |int |yyerror |NN const char *const s +p |void |yyquit +p |void |abort_execution|NN const char * const msg|NN const char * const name p |int |yyerror_pv |NN const char *const s|U32 flags -p |int |yyerror_pvn |NN const char *const s|STRLEN len|U32 flags +p |int |yyerror_pvn |NULLOK const char *const s|STRLEN len|U32 flags : Used in perly.y, and by Data::Alias EXp |int |yylex p |void |yyunlex diff --git a/embed.h b/embed.h index 2233a35e80..5b9c46c578 100644 --- a/embed.h +++ b/embed.h @@ -27,6 +27,7 @@ /* Hide global symbols */ #define Gv_AMupdate(a,b) Perl_Gv_AMupdate(aTHX_ a,b) +#define _byte_dump_string(a,b,c) Perl__byte_dump_string(aTHX_ a,b,c) #define _force_out_malformed_utf8_message(a,b,c,d) Perl__force_out_malformed_utf8_message(aTHX_ a,b,c,d) #define _is_in_locale_category(a,b) Perl__is_in_locale_category(aTHX_ a,b) #define _is_uni_FOO(a,b) Perl__is_uni_FOO(aTHX_ a,b) @@ -1164,6 +1165,7 @@ #ifdef PERL_CORE #define Slab_Alloc(a) Perl_Slab_Alloc(aTHX_ a) #define Slab_Free(a) Perl_Slab_Free(aTHX_ a) +#define abort_execution(a,b) Perl_abort_execution(aTHX_ a,b) #define alloc_LOGOP(a,b,c) Perl_alloc_LOGOP(aTHX_ a,b,c) #define allocmy(a,b,c) Perl_allocmy(aTHX_ a,b,c) #define amagic_is_enabled(a) Perl_amagic_is_enabled(aTHX_ a) @@ -1351,6 +1353,7 @@ #define newXS_len_flags(a,b,c,d,e,f,g) Perl_newXS_len_flags(aTHX_ a,b,c,d,e,f,g) #define nextargv(a,b) Perl_nextargv(aTHX_ a,b) #define noperl_die Perl_noperl_die +#define notify_parser_that_changed_to_utf8() Perl_notify_parser_that_changed_to_utf8(aTHX) #define oopsAV(a) Perl_oopsAV(aTHX_ a) #define oopsHV(a) Perl_oopsHV(aTHX_ a) #define op_unscope(a) Perl_op_unscope(aTHX_ a) @@ -1414,6 +1417,7 @@ #define yyerror_pv(a,b) Perl_yyerror_pv(aTHX_ a,b) #define yyerror_pvn(a,b,c) Perl_yyerror_pvn(aTHX_ a,b,c) #define yyparse(a) Perl_yyparse(aTHX_ a) +#define yyquit() Perl_yyquit(aTHX) #define yyunlex() Perl_yyunlex(aTHX) # if !(defined(DEBUGGING)) # if !defined(NV_PRESERVES_UV) @@ -1836,7 +1840,6 @@ #define isa_lookup(a,b,c,d) S_isa_lookup(aTHX_ a,b,c,d) # endif # if defined(PERL_IN_UTF8_C) -#define _byte_dump_string(a,b) S__byte_dump_string(aTHX_ a,b) #define _to_utf8_case(a,b,c,d,e,f,g) S__to_utf8_case(aTHX_ a,b,c,d,e,f,g) #define check_and_deprecate(a,b,c,d,e,f) S_check_and_deprecate(aTHX_ a,b,c,d,e,f) #define check_locale_boundary_crossing(a,b,c,d) S_check_locale_boundary_crossing(aTHX_ a,b,c,d) diff --git a/lib/utf8.t b/lib/utf8.t index e5f9547a8a..d35110baee 100644 --- a/lib/utf8.t +++ b/lib/utf8.t @@ -168,7 +168,7 @@ no utf8; # Ironic, no? use utf8; %a = ("$malformed" =>"sterling"); print 'start'; printf '%x,', ord \$_ foreach keys %a; print "end\n"; BANG - qr/^Malformed UTF-8 character: .*? \(too short; \d bytes? available, need \d\).*start\d+,end$/sm + qr/^Malformed UTF-8 character: .*? \(unexpected non-continuation byte/ ], ); foreach (@tests) { diff --git a/locale.c b/locale.c index 01962ea235..1ba802f1ae 100644 --- a/locale.c +++ b/locale.c @@ -1908,14 +1908,12 @@ Perl__mem_collxfrm(pTHX_ const char *input_string, #ifdef DEBUGGING if (DEBUG_Lv_TEST || debug_initialization) { - Size_t i; print_collxfrm_input_and_return(s, s + len, xlen, utf8); PerlIO_printf(Perl_debug_log, "Its xfrm is:"); - for (i = COLLXFRM_HDR_LEN; i < *xlen + COLLXFRM_HDR_LEN; i++) { - PerlIO_printf(Perl_debug_log, " %02x", (U8) xbuf[i]); - } - PerlIO_printf(Perl_debug_log, "\n"); + PerlIO_printf(Perl_debug_log, "%s\n", + _byte_dump_string((U8 *) xbuf + COLLXFRM_HDR_LEN, + *xlen, 1)); } #endif diff --git a/mg.c b/mg.c index 172127c865..6e648d8c2e 100644 --- a/mg.c +++ b/mg.c @@ -944,7 +944,7 @@ Perl_magic_get(pTHX_ SV *sv, MAGIC *mg) } break; case '\010': /* ^H */ - sv_setiv(sv, (IV)PL_hints); + sv_setuv(sv, PL_hints); break; case '\011': /* ^I */ /* NOT \t in EBCDIC */ sv_setpv(sv, PL_inplace); /* Will undefine sv if PL_inplace is NULL */ @@ -2737,7 +2737,15 @@ Perl_magic_set(pTHX_ SV *sv, MAGIC *mg) PL_maxsysfd = SvIV(sv); break; case '\010': /* ^H */ - PL_hints = SvIV(sv); + { + U32 save_hints = PL_hints; + PL_hints = SvUV(sv); + + /* If wasn't UTF-8, and now is, notify the parser */ + if ((PL_hints & HINT_UTF8) && ! (save_hints & HINT_UTF8)) { + notify_parser_that_changed_to_utf8(); + } + } break; case '\011': /* ^I */ /* NOT \t in EBCDIC */ Safefree(PL_inplace); diff --git a/parser.h b/parser.h index ad148c26eb..4187e0a93d 100644 --- a/parser.h +++ b/parser.h @@ -115,6 +115,8 @@ typedef struct yy_parser { IV sig_optelems; /* number of optional signature elems seen */ char sig_slurpy; /* the sigil of the slurpy var (or null) */ + bool recheck_utf8_validity; + PERL_BITFIELD16 in_pod:1; /* lexer is within a =pod section */ PERL_BITFIELD16 filtered:1; /* source filters in evalbytes */ PERL_BITFIELD16 saw_infix_sigil:1; /* saw & or * or % operator */ diff --git a/perl.c b/perl.c index 09eb2f4f62..98bf356652 100644 --- a/perl.c +++ b/perl.c @@ -2374,12 +2374,7 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit) SETERRNO(0,SS_NORMAL); if (yyparse(GRAMPROG) || PL_parser->error_count) { - if (PL_minus_c) - Perl_croak(aTHX_ "%s had compilation errors.\n", PL_origfilename); - else { - Perl_croak(aTHX_ "Execution of %s aborted due to compilation errors.\n", - PL_origfilename); - } + abort_execution("", PL_origfilename); } CopLINE_set(PL_curcop, 0); SET_CURSTASH(PL_defstash); diff --git a/pod/perldiag.pod b/pod/perldiag.pod index c2408f047d..99da61ec1f 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -2193,7 +2193,7 @@ variable and glob that. (F) The C<exec> function is not implemented on some systems, e.g., Symbian OS. See L<perlport>. -=item Execution of %s aborted due to compilation errors. +=item %sExecution of %s aborted due to compilation errors. (F) The final summary message when a Perl compilation fails. @@ -3414,14 +3414,6 @@ message. See also L<Encode/"Handling Malformed Data">. -=item Malformed UTF-8 character immediately after '%s' - -(F) You said C<use utf8>, but the program file doesn't comply with UTF-8 -encoding rules. The message prints out the properly encoded characters -just before the first bad one. If C<utf8> warnings are enabled, a -warning is generated that gives more details about the type of -malformation. - =item Malformed UTF-8 returned by \N{%s} immediately after '%s' (F) The charnames handler returned malformed UTF-8. diff --git a/proto.h b/proto.h index 7ec784981a..c61980e585 100644 --- a/proto.h +++ b/proto.h @@ -41,6 +41,9 @@ PERL_CALLCONV void* Perl_Slab_Alloc(pTHX_ size_t sz) PERL_CALLCONV void Perl_Slab_Free(pTHX_ void *op); #define PERL_ARGS_ASSERT_SLAB_FREE \ assert(op) +PERL_CALLCONV char * Perl__byte_dump_string(pTHX_ const U8 * s, const STRLEN len, const bool format); +#define PERL_ARGS_ASSERT__BYTE_DUMP_STRING \ + assert(s) PERL_CALLCONV void Perl__force_out_malformed_utf8_message(pTHX_ const U8 *const p, const U8 * const e, const U32 flags, const bool die_here); #define PERL_ARGS_ASSERT__FORCE_OUT_MALFORMED_UTF8_MESSAGE \ assert(p); assert(e) @@ -115,6 +118,9 @@ PERL_CALLCONV UV Perl__to_utf8_upper_flags(pTHX_ const U8 *p, const U8 *e, U8* u #define PERL_ARGS_ASSERT__TO_UTF8_UPPER_FLAGS \ assert(p); assert(ustrp); assert(file) PERL_CALLCONV void Perl__warn_problematic_locale(void); +PERL_CALLCONV void Perl_abort_execution(pTHX_ const char * const msg, const char * const name); +#define PERL_ARGS_ASSERT_ABORT_EXECUTION \ + assert(msg); assert(name) PERL_CALLCONV LOGOP* Perl_alloc_LOGOP(pTHX_ I32 type, OP *first, OP *other); PERL_CALLCONV PADOFFSET Perl_allocmy(pTHX_ const char *const name, const STRLEN len, const U32 flags); #define PERL_ARGS_ASSERT_ALLOCMY \ @@ -2331,6 +2337,7 @@ PERL_CALLCONV_NO_RET void Perl_noperl_die(const char* pat, ...) assert(pat) PERL_CALLCONV int Perl_nothreadhook(pTHX); +PERL_CALLCONV void Perl_notify_parser_that_changed_to_utf8(pTHX); PERL_CALLCONV OP* Perl_oopsAV(pTHX_ OP* o) __attribute__warn_unused_result__; #define PERL_ARGS_ASSERT_OOPSAV \ @@ -3695,10 +3702,9 @@ PERL_CALLCONV int Perl_yyerror_pv(pTHX_ const char *const s, U32 flags); #define PERL_ARGS_ASSERT_YYERROR_PV \ assert(s) PERL_CALLCONV int Perl_yyerror_pvn(pTHX_ const char *const s, STRLEN len, U32 flags); -#define PERL_ARGS_ASSERT_YYERROR_PVN \ - assert(s) PERL_CALLCONV int Perl_yylex(pTHX); PERL_CALLCONV int Perl_yyparse(pTHX_ int gramtype); +PERL_CALLCONV void Perl_yyquit(pTHX); PERL_CALLCONV void Perl_yyunlex(pTHX); #if !(defined(DEBUGGING)) # if !defined(NV_PRESERVES_UV) @@ -5621,9 +5627,6 @@ STATIC bool S_isa_lookup(pTHX_ HV *stash, const char * const name, STRLEN len, U assert(stash); assert(name) #endif #if defined(PERL_IN_UTF8_C) -STATIC char * S__byte_dump_string(pTHX_ const U8 * s, const STRLEN len); -#define PERL_ARGS_ASSERT__BYTE_DUMP_STRING \ - assert(s) STATIC UV S__to_utf8_case(pTHX_ const UV uv1, const U8 *p, U8* ustrp, STRLEN *lenp, SV **swashp, const char *normal, const char *special); #define PERL_ARGS_ASSERT__TO_UTF8_CASE \ assert(p); assert(ustrp); assert(swashp); assert(normal) diff --git a/sv.c b/sv.c index 472d69c4e5..e0c327a350 100644 --- a/sv.c +++ b/sv.c @@ -13204,6 +13204,7 @@ Perl_parser_dup(pTHX_ const yy_parser *const proto, CLONE_PARAMS *const param) parser->sig_elems = proto->sig_elems; parser->sig_optelems= proto->sig_optelems; parser->sig_slurpy = proto->sig_slurpy; + parser->recheck_utf8_validity = proto->recheck_utf8_validity; parser->linestr = sv_dup_inc(proto->linestr, param); { diff --git a/t/lib/croak/toke b/t/lib/croak/toke index f1817b3a7e..40354955e5 100644 --- a/t/lib/croak/toke +++ b/t/lib/croak/toke @@ -180,9 +180,16 @@ Execution of - aborted due to compilation errors. # NAME Regexp constant overloading when *^H is undefined use overload; BEGIN { overload::constant qr => sub {}; undef *^H } -/a/, m'a' +/a/ EXPECT Constant(qq) unknown at - line 3, within pattern +Execution of - aborted due to compilation errors. +######## +# NAME Regexp constant overloading when *^H is undefined +use overload; +BEGIN { overload::constant qr => sub {}; undef *^H } +m'a' +EXPECT Constant(q) unknown at - line 3, within pattern Execution of - aborted due to compilation errors. ######## @@ -232,9 +239,16 @@ Execution of - aborted due to compilation errors. # NAME Regexp constant overloading returning undef use overload; BEGIN { overload::constant qr => sub {} } -/a/, m'a' +/a/ EXPECT Constant(qq): Call to &{$^H{qr}} did not return a defined value at - line 3, within pattern +Execution of - aborted due to compilation errors. +######## +# NAME Regexp constant overloading returning undef +use overload; +BEGIN { overload::constant qr => sub {} } +m'a' +EXPECT Constant(q): Call to &{$^H{qr}} did not return a defined value at - line 3, within pattern Execution of - aborted due to compilation errors. ######## diff --git a/t/lib/croak/toke_l1 b/t/lib/croak/toke_l1 index bb85b025d5..6d656beb7c 100644 Binary files a/t/lib/croak/toke_l1 and b/t/lib/croak/toke_l1 differ diff --git a/t/re/re_tests b/t/re/re_tests index f210202657..410fceadac 100644 --- a/t/re/re_tests +++ b/t/re/re_tests @@ -1464,13 +1464,10 @@ abc\N abc\n n # Verify get errors. For these, we need // or else puts it in single quotes, # and bypasses the lexer. /\N{U+}/ - c - Invalid hexadecimal number -# Below currently gives a misleading message -/[\N{U+}]/ - Sc - Unmatched -/[\N{U+}]/ - sc - Syntax error in (?[...]) +/[\N{U+}]/ - c - Invalid hexadecimal number /abc\N{def/ - c - Missing right brace /\N{U+4AG3}/ - c - Invalid hexadecimal number -/[\N{U+4AG3}]/ - Sc - Unmatched -/[\N{U+4AG3}]/ - sc - Syntax error in (?[...]) +/[\N{U+4AG3}]/ - c - Invalid hexadecimal number # And verify that in single quotes which bypasses the lexer, the regex compiler # figures it out. diff --git a/t/uni/parser.t b/t/uni/parser.t index 624fdd04df..2c68fb0473 100644 --- a/t/uni/parser.t +++ b/t/uni/parser.t @@ -197,7 +197,7 @@ like( $@, qr/Bad name after Fï½ï½'/, 'Bad name after Fï½ï½\'' ); ? "\x{74}\x{41}" : "\x{c0}\x{a0}"; CORE::evalbytes "use charnames ':full'; use utf8; my \$x = \"\\N{abc$malformed_to_be}\""; - like( $@, qr/Malformed UTF-8 character immediately after '\\N\{abc' at .* within string/, 'Malformed UTF-8 input to \N{}'); + like( $@, qr/Malformed UTF-8 character \(fatal\) at /, 'Malformed UTF-8 input to \N{}'); } # RT# 124216: Perl_sv_clear: Assertion diff --git a/toke.c b/toke.c index d529f03b72..b9ea8486c1 100644 --- a/toke.c +++ b/toke.c @@ -701,7 +701,6 @@ Perl_lex_start(pTHX_ SV *line, PerlIO *rsfp, U32 flags) { const char *s = NULL; yy_parser *parser, *oparser; - const U8* first_bad_char_loc; if (flags && flags & ~LEX_START_FLAGS) Perl_croak(aTHX_ "Lexing code internal error (%s)", "lex_start"); @@ -728,6 +727,7 @@ Perl_lex_start(pTHX_ SV *line, PerlIO *rsfp, U32 flags) parser->lex_state = LEX_NORMAL; parser->expect = XSTATE; parser->rsfp = rsfp; + parser->recheck_utf8_validity = FALSE; parser->rsfp_filters = !(flags & LEX_START_SAME_FILTER) || !oparser ? NULL @@ -744,11 +744,14 @@ Perl_lex_start(pTHX_ SV *line, PerlIO *rsfp, U32 flags) if (line) { STRLEN len; + const U8* first_bad_char_loc; + s = SvPV_const(line, len); - if (SvUTF8(line) && ! is_utf8_string_loc((U8 *) s, - SvCUR(line), - &first_bad_char_loc)) + if ( SvUTF8(line) + && UNLIKELY(! is_utf8_string_loc((U8 *) s, + SvCUR(line), + &first_bad_char_loc))) { _force_out_malformed_utf8_message(first_bad_char_loc, (U8 *) s + SvCUR(line), @@ -765,6 +768,7 @@ Perl_lex_start(pTHX_ SV *line, PerlIO *rsfp, U32 flags) } else { parser->linestr = newSVpvn("\n;", rsfp ? 1 : 2); } + parser->oldoldbufptr = parser->oldbufptr = parser->bufptr = @@ -1053,12 +1057,7 @@ Perl_lex_stuff_pvn(pTHX_ const char *pv, STRLEN len, U32 flags) } else if (UTF8_IS_NEXT_CHAR_DOWNGRADEABLE(p, e)) { p++; highhalf++; - } else if (! UTF8_IS_INVARIANT(c)) { - _force_out_malformed_utf8_message((U8 *) p, (U8 *) e, - 0, - 1 /* 1 means die */ ); - NOT_REACHED; /* NOTREACHED */ - } + } else assert(UTF8_IS_INVARIANT(c)); } if (!highhalf) goto plain_copy; @@ -1272,6 +1271,24 @@ Perl_lex_discard_to(pTHX_ char *ptr) PL_parser->last_lop -= discard_len; } +void +Perl_notify_parser_that_changed_to_utf8(pTHX) +{ + /* Called when $^H is changed to indicate that HINT_UTF8 has changed from + * off to on. At compile time, this has the effect of entering a 'use + * utf8' section. This means that any input was not previously checked for + * UTF-8 (because it was off), but now we do need to check it, or our + * assumptions about the input being sane could be wrong, and we could + * segfault. This routine just sets a flag so that the next time we look + * at the input we do the well-formed UTF-8 check. If we aren't in the + * proper phase, there may not be a parser object, but if there is, setting + * the flag is harmless */ + + if (PL_parser) { + PL_parser->recheck_utf8_validity = TRUE; + } +} + /* =for apidoc Amx|bool|lex_next_chunk|U32 flags @@ -1307,7 +1324,6 @@ Perl_lex_next_chunk(pTHX_ U32 flags) STRLEN linestart_pos, last_uni_pos, last_lop_pos; bool got_some_for_debugger = 0; bool got_some; - const U8* first_bad_char_loc; if (flags & ~(LEX_KEEP_PREVIOUS|LEX_FAKE_EOF|LEX_NO_TERM)) Perl_croak(aTHX_ "Lexing code internal error (%s)", "lex_next_chunk"); @@ -1374,15 +1390,19 @@ Perl_lex_next_chunk(pTHX_ U32 flags) PL_parser->bufend = buf + new_bufend_pos; PL_parser->bufptr = buf + bufptr_pos; - if (UTF && ! is_utf8_string_loc((U8 *) PL_parser->bufptr, - PL_parser->bufend - PL_parser->bufptr, - &first_bad_char_loc)) - { - _force_out_malformed_utf8_message(first_bad_char_loc, - (U8 *) PL_parser->bufend, - 0, - 1 /* 1 means die */ ); - NOT_REACHED; /* NOTREACHED */ + if (UTF) { + const U8* first_bad_char_loc; + if (UNLIKELY(! is_utf8_string_loc( + (U8 *) PL_parser->bufptr, + PL_parser->bufend - PL_parser->bufptr, + &first_bad_char_loc))) + { + _force_out_malformed_utf8_message(first_bad_char_loc, + (U8 *) PL_parser->bufend, + 0, + 1 /* 1 means die */ ); + NOT_REACHED; /* NOTREACHED */ + } } PL_parser->oldbufptr = buf + oldbufptr_pos; @@ -2263,10 +2283,9 @@ S_force_strict_version(pTHX_ char *s) /* * S_tokeq - * Tokenize a quoted string passed in as an SV. It finds the next - * chunk, up to end of string or a backslash. It may make a new - * SV containing that chunk (if HINT_NEW_STRING is on). It also - * turns \\ into \. + * Turns any \\ into \ in a quoted string passed in in 'sv', returning 'sv', + * modified as necessary. However, if HINT_NEW_STRING is on, 'sv' is + * unchanged, and a new SV containing the modified input is returned. */ STATIC SV * @@ -2571,7 +2590,6 @@ S_get_and_check_backslash_N_name(pTHX_ const char* s, const char* const e) SV *cv; SV *rv; HV *stash; - const U8* first_bad_char_loc; const char* backslash_ptr = s - 3; /* Points to the <\> of \N{... */ PERL_ARGS_ASSERT_GET_AND_CHECK_BACKSLASH_N_NAME; @@ -2581,21 +2599,6 @@ S_get_and_check_backslash_N_name(pTHX_ const char* s, const char* const e) return res; } - if (UTF && ! is_utf8_string_loc((U8 *) backslash_ptr, - e - backslash_ptr, - &first_bad_char_loc)) - { - _force_out_malformed_utf8_message(first_bad_char_loc, - (U8 *) PL_parser->bufend, - 0, - 0 /* 0 means don't die */ ); - yyerror_pv(Perl_form(aTHX_ - "Malformed UTF-8 character immediately after '%.*s'", - (int) (first_bad_char_loc - (U8 *) backslash_ptr), backslash_ptr), - SVf_UTF8); - return NULL; - } - res = new_constant( NULL, 0, "charnames", res, NULL, backslash_ptr, /* include the <}> */ e - backslash_ptr + 1); @@ -2720,7 +2723,9 @@ S_get_and_check_backslash_N_name(pTHX_ const char* s, const char* const e) const U8* first_bad_char_loc; STRLEN len; const char* const str = SvPV_const(res, len); - if (! is_utf8_string_loc((U8 *) str, len, &first_bad_char_loc)) { + if (UNLIKELY(! is_utf8_string_loc((U8 *) str, len, + &first_bad_char_loc))) + { _force_out_malformed_utf8_message(first_bad_char_loc, (U8 *) PL_parser->bufend, 0, @@ -3024,7 +3029,6 @@ S_scan_const(pTHX_ char *start) bool convert_unicode; IV real_range_max = 0; #endif - /* Get the code point values of the range ends. */ if (has_utf8) { /* We know the utf8 is valid, because we just constructed @@ -3050,7 +3054,11 @@ S_scan_const(pTHX_ char *start) * that code point is already in the output, twice. We can * just back up over the second instance and avoid all the rest * of the work. But if it is a variant character, it's been - * counted twice, so decrement */ + * counted twice, so decrement. (This unlikely scenario is + * special cased, like the one for a range of 2 code points + * below, only because the main-line code below needs a range + * of 3 or more to work without special casing. Might as well + * get it out of the way now.) */ if (UNLIKELY(range_max == range_min)) { d = max_ptr; if (! has_utf8 && ! UVCHR_IS_INVARIANT(range_max)) { @@ -3633,6 +3641,7 @@ S_scan_const(pTHX_ char *start) s++; if (*s != '{') { yyerror("Missing braces on \\N{}"); + *d++ = '\0'; continue; } s++; @@ -3644,7 +3653,7 @@ S_scan_const(pTHX_ char *start) } else { yyerror("Missing right brace on \\N{} or unescaped left brace after \\N"); } - continue; + yyquit(); /* Have exhausted the input. */ } /* Here it looks like a named character */ @@ -3663,6 +3672,7 @@ S_scan_const(pTHX_ char *start) "Invalid hexadecimal number in \\N{U+...}" ); s = e + 1; + *d++ = '\0'; continue; } while (++s < e) { @@ -3861,6 +3871,7 @@ S_scan_const(pTHX_ char *start) " in transliteration operator", /* +1 to include the "}" */ (int) (e + 1 - start), start)); + *d++ = '\0'; goto end_backslash_N; } @@ -3926,15 +3937,16 @@ S_scan_const(pTHX_ char *start) case 'c': s++; if (s < send) { - *d++ = grok_bslash_c(*s++, 1); + *d++ = grok_bslash_c(*s, 1); } else { yyerror("Missing control char name in \\c"); + yyquit(); /* Are at end of input, no sense continuing */ } #ifdef EBCDIC non_portable_endpoint++; #endif - continue; + break; /* printf-style backslashes, formfeeds, newlines, etc */ case 'b': @@ -4752,6 +4764,20 @@ Perl_yylex(pTHX) GV *gv = NULL; GV **gvp = NULL; + if (UNLIKELY(PL_parser->recheck_utf8_validity)) { + const U8* first_bad_char_loc; + if (UTF && UNLIKELY(! is_utf8_string_loc((U8 *) PL_bufptr, + PL_bufend - PL_bufptr, + &first_bad_char_loc))) + { + _force_out_malformed_utf8_message(first_bad_char_loc, + (U8 *) PL_bufend, + 0, + 1 /* 1 means die */ ); + NOT_REACHED; /* NOTREACHED */ + } + PL_parser->recheck_utf8_validity = FALSE; + } DEBUG_T( { SV* tmp = newSVpvs(""); PerlIO_printf(Perl_debug_log, "### %" IVdf ":LEX_%s/X%s %s\n", @@ -4999,7 +5025,16 @@ Perl_yylex(pTHX) s = PL_bufend; } else { + int save_error_count = PL_error_count; + s = scan_const(PL_bufptr); + + /* Quit if this was a pattern and there were errors. This prevents + * us from trying to regex compile a broken pattern, which could + * lead to segfaults, etc. */ + if (PL_lex_inpat && PL_error_count > save_error_count) { + yyquit(); + } if (*s == '\\') PL_lex_state = LEX_INTERPCASEMOD; else @@ -5108,12 +5143,6 @@ Perl_yylex(pTHX) switch (*s) { default: if (UTF) { - if (! isUTF8_CHAR((U8 *) s, (U8 *) PL_bufend)) { - _force_out_malformed_utf8_message((U8 *) s, (U8 *) PL_bufend, - 0, - 1 /* 1 means die */ ); - NOT_REACHED; /* NOTREACHED */ - } if (isIDFIRST_utf8_safe(s, PL_bufend)) { goto keylookup; } @@ -10542,10 +10571,14 @@ S_scan_str(pTHX_ char *start, int keep_bracketed_quoted, int keep_delims, int re } /* terminate when run out of buffer (the for() condition), or have found the terminator */ - else if (*s == term) { - if (termlen == 1) + else if (*s == term) { /* First byte of terminator matches */ + if (termlen == 1) /* If is the only byte, are done */ break; - if (s+termlen <= PL_bufend && memEQ(s, (char*)termstr, termlen)) + + /* If the remainder of the terminator matches, also are + * done, after checking that is a separate grapheme */ + if ( s + termlen <= PL_bufend + && memEQ(s + 1, (char*)termstr + 1, termlen - 1)) { if ( check_grapheme && UNLIKELY(! _is_grapheme((U8 *) start, @@ -10559,8 +10592,10 @@ S_scan_str(pTHX_ char *start, int keep_bracketed_quoted, int keep_delims, int re break; } } - else if (!has_utf8 && !UTF8_IS_INVARIANT((U8)*s) && UTF) + else if (!has_utf8 && !UTF8_IS_INVARIANT((U8)*s) && UTF) { has_utf8 = TRUE; + } + *to = *s; } } @@ -11428,6 +11463,29 @@ S_yywarn(pTHX_ const char *const s, U32 flags) return 0; } +void +Perl_abort_execution(pTHX_ const char * const msg, const char * const name) +{ + PERL_ARGS_ASSERT_ABORT_EXECUTION; + + if (PL_minus_c) + Perl_croak(aTHX_ "%s%s had compilation errors.\n", msg, name); + else { + Perl_croak(aTHX_ + "%sExecution of %s aborted due to compilation errors.\n", msg, name); + } + NOT_REACHED; /* NOTREACHED */ +} + +void +Perl_yyquit(pTHX) +{ + /* Called, after at least one error has been found, to abort the parse now, + * instead of trying to forge ahead */ + + yyerror_pvn(NULL, 0, 0); +} + int Perl_yyerror(pTHX_ const char *const s) { @@ -11451,100 +11509,120 @@ Perl_yyerror_pvn(pTHX_ const char *const s, STRLEN len, U32 flags) SV * const where_sv = newSVpvs_flags("", SVs_TEMP); int yychar = PL_parser->yychar; - PERL_ARGS_ASSERT_YYERROR_PVN; - - if (!yychar || (yychar == ';' && !PL_rsfp)) - sv_catpvs(where_sv, "at EOF"); - else if ( PL_oldoldbufptr - && PL_bufptr > PL_oldoldbufptr - && PL_bufptr - PL_oldoldbufptr < 200 - && PL_oldoldbufptr != PL_oldbufptr - && PL_oldbufptr != PL_bufptr) - { - /* - Only for NetWare: - The code below is removed for NetWare because it abends/crashes on NetWare - when the script has error such as not having the closing quotes like: - if ($var eq "value) - Checking of white spaces is anyway done in NetWare code. - */ + /* Output error message 's' with length 'len'. 'flags' are SV flags that + * apply. If the number of errors found is large enough, it abandons + * parsing. If 's' is NULL, there is no message, and it abandons + * processing unconditionally */ + + if (s != NULL) { + if (!yychar || (yychar == ';' && !PL_rsfp)) + sv_catpvs(where_sv, "at EOF"); + else if ( PL_oldoldbufptr + && PL_bufptr > PL_oldoldbufptr + && PL_bufptr - PL_oldoldbufptr < 200 + && PL_oldoldbufptr != PL_oldbufptr + && PL_oldbufptr != PL_bufptr) + { + /* + Only for NetWare: + The code below is removed for NetWare because it + abends/crashes on NetWare when the script has error such as + not having the closing quotes like: + if ($var eq "value) + Checking of white spaces is anyway done in NetWare code. + */ #ifndef NETWARE - while (isSPACE(*PL_oldoldbufptr)) - PL_oldoldbufptr++; + while (isSPACE(*PL_oldoldbufptr)) + PL_oldoldbufptr++; #endif - context = PL_oldoldbufptr; - contlen = PL_bufptr - PL_oldoldbufptr; - } - else if ( PL_oldbufptr - && PL_bufptr > PL_oldbufptr - && PL_bufptr - PL_oldbufptr < 200 - && PL_oldbufptr != PL_bufptr) { - /* - Only for NetWare: - The code below is removed for NetWare because it abends/crashes on NetWare - when the script has error such as not having the closing quotes like: - if ($var eq "value) - Checking of white spaces is anyway done in NetWare code. - */ + context = PL_oldoldbufptr; + contlen = PL_bufptr - PL_oldoldbufptr; + } + else if ( PL_oldbufptr + && PL_bufptr > PL_oldbufptr + && PL_bufptr - PL_oldbufptr < 200 + && PL_oldbufptr != PL_bufptr) { + /* + Only for NetWare: + The code below is removed for NetWare because it + abends/crashes on NetWare when the script has error such as + not having the closing quotes like: + if ($var eq "value) + Checking of white spaces is anyway done in NetWare code. + */ #ifndef NETWARE - while (isSPACE(*PL_oldbufptr)) - PL_oldbufptr++; + while (isSPACE(*PL_oldbufptr)) + PL_oldbufptr++; #endif - context = PL_oldbufptr; - contlen = PL_bufptr - PL_oldbufptr; - } - else if (yychar > 255) - sv_catpvs(where_sv, "next token ???"); - else if (yychar == YYEMPTY) { - if (PL_lex_state == LEX_NORMAL) - sv_catpvs(where_sv, "at end of line"); - else if (PL_lex_inpat) - sv_catpvs(where_sv, "within pattern"); - else - sv_catpvs(where_sv, "within string"); - } - else { - sv_catpvs(where_sv, "next char "); - if (yychar < 32) - Perl_sv_catpvf(aTHX_ where_sv, "^%c", toCTRL(yychar)); - else if (isPRINT_LC(yychar)) { - const char string = yychar; - sv_catpvn(where_sv, &string, 1); - } - else - Perl_sv_catpvf(aTHX_ where_sv, "\\%03o", yychar & 255); - } - msg = newSVpvn_flags(s, len, (flags & SVf_UTF8) | SVs_TEMP); - Perl_sv_catpvf(aTHX_ msg, " at %s line %" IVdf ", ", - OutCopFILE(PL_curcop), - (IV)(PL_parser->preambling == NOLINE - ? CopLINE(PL_curcop) - : PL_parser->preambling)); - if (context) - Perl_sv_catpvf(aTHX_ msg, "near \"%" UTF8f "\"\n", - UTF8fARG(UTF, contlen, context)); - else - Perl_sv_catpvf(aTHX_ msg, "%" SVf "\n", SVfARG(where_sv)); - if (PL_multi_start < PL_multi_end && (U32)(CopLINE(PL_curcop) - PL_multi_end) <= 1) { - Perl_sv_catpvf(aTHX_ msg, - " (Might be a runaway multi-line %c%c string starting on line %" IVdf ")\n", - (int)PL_multi_open,(int)PL_multi_close,(IV)PL_multi_start); - PL_multi_end = 0; - } - if (PL_in_eval & EVAL_WARNONLY) { - PL_in_eval &= ~EVAL_WARNONLY; - Perl_ck_warner_d(aTHX_ packWARN(WARN_SYNTAX), "%" SVf, SVfARG(msg)); + context = PL_oldbufptr; + contlen = PL_bufptr - PL_oldbufptr; + } + else if (yychar > 255) + sv_catpvs(where_sv, "next token ???"); + else if (yychar == YYEMPTY) { + if (PL_lex_state == LEX_NORMAL) + sv_catpvs(where_sv, "at end of line"); + else if (PL_lex_inpat) + sv_catpvs(where_sv, "within pattern"); + else + sv_catpvs(where_sv, "within string"); + } + else { + sv_catpvs(where_sv, "next char "); + if (yychar < 32) + Perl_sv_catpvf(aTHX_ where_sv, "^%c", toCTRL(yychar)); + else if (isPRINT_LC(yychar)) { + const char string = yychar; + sv_catpvn(where_sv, &string, 1); + } + else + Perl_sv_catpvf(aTHX_ where_sv, "\\%03o", yychar & 255); + } + msg = newSVpvn_flags(s, len, (flags & SVf_UTF8) | SVs_TEMP); + Perl_sv_catpvf(aTHX_ msg, " at %s line %" IVdf ", ", + OutCopFILE(PL_curcop), + (IV)(PL_parser->preambling == NOLINE + ? CopLINE(PL_curcop) + : PL_parser->preambling)); + if (context) + Perl_sv_catpvf(aTHX_ msg, "near \"%" UTF8f "\"\n", + UTF8fARG(UTF, contlen, context)); + else + Perl_sv_catpvf(aTHX_ msg, "%" SVf "\n", SVfARG(where_sv)); + if ( PL_multi_start < PL_multi_end + && (U32)(CopLINE(PL_curcop) - PL_multi_end) <= 1) + { + Perl_sv_catpvf(aTHX_ msg, + " (Might be a runaway multi-line %c%c string starting on" + " line %" IVdf ")\n", + (int)PL_multi_open,(int)PL_multi_close,(IV)PL_multi_start); + PL_multi_end = 0; + } + if (PL_in_eval & EVAL_WARNONLY) { + PL_in_eval &= ~EVAL_WARNONLY; + Perl_ck_warner_d(aTHX_ packWARN(WARN_SYNTAX), "%" SVf, SVfARG(msg)); + } + else { + qerror(msg); + } } - else - qerror(msg); - if (PL_error_count >= 10) { - SV * errsv; - if (PL_in_eval && ((errsv = ERRSV), SvCUR(errsv))) - Perl_croak(aTHX_ "%" SVf "%s has too many errors.\n", - SVfARG(errsv), OutCopFILE(PL_curcop)); - else - Perl_croak(aTHX_ "%s has too many errors.\n", - OutCopFILE(PL_curcop)); + if (s == NULL || PL_error_count >= 10) { + const char * msg = ""; + const char * const name = OutCopFILE(PL_curcop); + + if (PL_in_eval) { + SV * errsv = ERRSV; + if (SvCUR(errsv)) { + msg = Perl_form(aTHX_ "%" SVf, SVfARG(errsv)); + } + } + + if (s == NULL) { + abort_execution(msg, name); + } + else { + Perl_croak(aTHX_ "%s%s has too many errors.\n", msg, name); + } } PL_in_my = 0; PL_in_my_stash = NULL; diff --git a/utf8.c b/utf8.c index bec68a5883..89c8413f7c 100644 --- a/utf8.c +++ b/utf8.c @@ -754,11 +754,15 @@ Perl__is_utf8_char_helper(const U8 * const s, const U8 * e, const U32 flags) return UTF8SKIP(s); } -STATIC char * -S__byte_dump_string(pTHX_ const U8 * s, const STRLEN len) +char * +Perl__byte_dump_string(pTHX_ const U8 * s, const STRLEN len, const bool format) { /* Returns a mortalized C string that is a displayable copy of the 'len' - * bytes starting at 's', each in a \xXY format. */ + * bytes starting at 's'. 'format' gives how to display each byte. + * Currently, there are only two formats, so it is currently a bool: + * 0 \xab + * 1 ab (that is a space between two hex digit bytes) + */ const STRLEN output_len = 4 * len + 1; /* 4 bytes per each input, plus a trailing NUL */ @@ -776,8 +780,13 @@ S__byte_dump_string(pTHX_ const U8 * s, const STRLEN len) const unsigned high_nibble = (*s & 0xF0) >> 4; const unsigned low_nibble = (*s & 0x0F); - *d++ = '\\'; - *d++ = 'x'; + if (format) { + *d++ = ' '; + } + else { + *d++ = '\\'; + *d++ = 'x'; + } if (high_nibble < 10) { *d++ = high_nibble + '0'; @@ -827,7 +836,7 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s, return Perl_form(aTHX_ "%s: %s (unexpected non-continuation byte 0x%02x," " %s after start byte 0x%02x; need %d bytes, got %d)", malformed_text, - _byte_dump_string(s, print_len), + _byte_dump_string(s, print_len, 0), *(s + non_cont_byte_pos), where, *s, @@ -1401,7 +1410,7 @@ Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, if (pack_warn) { message = Perl_form(aTHX_ "%s: %s (overflows)", malformed_text, - _byte_dump_string(s0, send - s0)); + _byte_dump_string(s0, send - s0, 0)); } } } @@ -1437,7 +1446,7 @@ Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, "%s: %s (unexpected continuation byte 0x%02x," " with no preceding start byte)", malformed_text, - _byte_dump_string(s0, 1), *s0); + _byte_dump_string(s0, 1, 0), *s0); } } } @@ -1452,7 +1461,7 @@ Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, message = Perl_form(aTHX_ "%s: %s (too short; %d byte%s available, need %d)", malformed_text, - _byte_dump_string(s0, send - s0), + _byte_dump_string(s0, send - s0, 0), (int)avail_len, avail_len == 1 ? "" : "s", (int)expectlen); @@ -1516,8 +1525,8 @@ Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, " should be represented with a" " different, shorter sequence)", malformed_text, - _byte_dump_string(s0, send - s0), - _byte_dump_string(s0, curlen)); + _byte_dump_string(s0, send - s0, 0), + _byte_dump_string(s0, curlen, 0)); } else { U8 tmpbuf[UTF8_MAXBYTES+1]; @@ -1527,8 +1536,8 @@ Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, "%s: %s (overlong; instead use %s to represent" " U+%0*" UVXf ")", malformed_text, - _byte_dump_string(s0, send - s0), - _byte_dump_string(tmpbuf, e - tmpbuf), + _byte_dump_string(s0, send - s0, 0), + _byte_dump_string(tmpbuf, e - tmpbuf, 0), ((uv < 256) ? 2 : 4), /* Field width of 2 for small code points */ uv); @@ -1553,7 +1562,7 @@ Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, message = Perl_form(aTHX_ "UTF-16 surrogate (any UTF-8 sequence that" " starts with \"%s\" is for a surrogate)", - _byte_dump_string(s0, curlen)); + _byte_dump_string(s0, curlen, 0)); } else { message = Perl_form(aTHX_ @@ -1583,7 +1592,7 @@ Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, "Any UTF-8 sequence that starts with" " \"%s\" is for a non-Unicode code point," " may not be portable", - _byte_dump_string(s0, curlen)); + _byte_dump_string(s0, curlen, 0)); } else { message = Perl_form(aTHX_ @@ -1622,7 +1631,7 @@ Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, "Any UTF-8 sequence that starts with" " \"%s\" is for a non-Unicode code" " point, and is not portable", - _byte_dump_string(s0, curlen)); + _byte_dump_string(s0, curlen, 0)); } else { message = Perl_form(aTHX_ -- Perl5 Master Repository
