Re: [PATCH v4 4/6] send-email: create email parser subroutine
On 06/14/2016 12:47 AM, Eric Wong wrote: Samuel GROOTwrote: On 06/09/2016 02:21 AM, Eric Wong wrote: Samuel GROOT wrote: Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1]. Should we handle \n\r at end of line as well? "\n\r" can never happen with local $/ = "\n" If the email file contains "\n\r", setting $/ = "\n" will leave "\r" at the beginning of each line. We could trim them with: s/^\r//; s/\r?\n$//; But is it worth adding `s/^\r//;` to handle that extremely rare case? I doubt it. Having a "\r" in the wrong place is likely a bug in whatever program that generated the email. It should be exposed so whoever generated that email has a chance to fix it on their end rather than being quietly hidden. s/\r?\n$// is fine then. Thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
Samuel GROOTwrote: > On 06/09/2016 02:21 AM, Eric Wong wrote: > >Samuel GROOT wrote: > >>Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1]. > >>Should we handle \n\r at end of line as well? > > > >"\n\r" can never happen with local $/ = "\n" > > If the email file contains "\n\r", setting $/ = "\n" will leave "\r" at > the beginning of each line. > > We could trim them with: > > s/^\r//; > s/\r?\n$//; > > But is it worth adding `s/^\r//;` to handle that extremely rare case? I doubt it. Having a "\r" in the wrong place is likely a bug in whatever program that generated the email. It should be exposed so whoever generated that email has a chance to fix it on their end rather than being quietly hidden. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On 06/09/2016 02:21 AM, Eric Wong wrote: Samuel GROOTwrote: Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1]. Should we handle \n\r at end of line as well? "\n\r" can never happen with local $/ = "\n" If the email file contains "\n\r", setting $/ = "\n" will leave "\r" at the beginning of each line. We could trim them with: s/^\r//; s/\r?\n$//; But is it worth adding `s/^\r//;` to handle that extremely rare case? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On 06/09/2016 08:51 AM, Eric Sunshine wrote: On Wed, Jun 8, 2016 at 7:54 PM, Samuel GROOTwrote: On 06/08/2016 10:17 PM, Junio C Hamano wrote: Eric Sunshine writes: An embedded CR probably shouldn't happen, but I'm not convinced that folding it out is a good idea. I would think that you'd want to preserve the header's value verbatim. If anything, I'd expect to see the regex tightened to: s/\r?\n$//; Yes, that would be more sensible than silently removing \r in the middle which _is_ a sign of something funny going on. s/\r?\n$// looks fine. Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1]. [1] * http://cpansearch.perl.org/src/RJBS/Email-Simple-2.210/lib/Email/Simple.pm You could certainly use s/\x0d?\x0a$// rather than s/\r?\n$// to be really robust, but I doubt it matters much today. The reason for using hex codes is that \r and \n will resolve to CR and LF in the local character encoding, and those may not be 0x0d and 0x0a, respectively. I believe the canonical example given in the Camel book was EBCIDIC in which \r is 0x0d, but \n is 0x25, not 0x0a as it is in ASCII. My question was more about the `\n\r` case handled by Email::Simple, does it make sense to handle it? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On Wed, Jun 8, 2016 at 7:54 PM, Samuel GROOTwrote: > On 06/08/2016 10:17 PM, Junio C Hamano wrote: >> Eric Sunshine writes: >>> An embedded CR probably shouldn't happen, but I'm not convinced that >>> folding it out is a good idea. I would think that you'd want to >>> preserve the header's value verbatim. If anything, I'd expect to see >>> the regex tightened to: >>> >>> s/\r?\n$//; >> >> Yes, that would be more sensible than silently removing \r in the >> middle which _is_ a sign of something funny going on. > > s/\r?\n$// looks fine. > > Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1]. > [1] * > http://cpansearch.perl.org/src/RJBS/Email-Simple-2.210/lib/Email/Simple.pm You could certainly use s/\x0d?\x0a$// rather than s/\r?\n$// to be really robust, but I doubt it matters much today. The reason for using hex codes is that \r and \n will resolve to CR and LF in the local character encoding, and those may not be 0x0d and 0x0a, respectively. I believe the canonical example given in the Camel book was EBCIDIC in which \r is 0x0d, but \n is 0x25, not 0x0a as it is in ASCII. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
Samuel GROOTwrote: > Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1]. > Should we handle \n\r at end of line as well? "\n\r" can never happen with local $/ = "\n" > [1] * > http://cpansearch.perl.org/src/RJBS/Email-Simple-2.210/lib/Email/Simple.pm -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On 06/08/2016 10:17 PM, Junio C Hamano wrote: Eric Sunshinewrites: An embedded CR probably shouldn't happen, but I'm not convinced that folding it out is a good idea. I would think that you'd want to preserve the header's value verbatim. If anything, I'd expect to see the regex tightened to: s/\r?\n$//; Yes, that would be more sensible than silently removing \r in the middle which _is_ a sign of something funny going on. Alternately, consider using 'chop' or 'chomp'. Even if you use chomp(), you'd still need to worry about possible \r at the end, no? 'chomp' is what we used before, but with *.eml files (microsoft's file format, with CRLF), '\n' were removed but '\r' remained, that's why we used s/\r\n|\r|\n//. s/\r?\n$// looks fine. Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1]. Should we handle \n\r at end of line as well? [1] * http://cpansearch.perl.org/src/RJBS/Email-Simple-2.210/lib/Email/Simple.pm -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
Samuel GROOTwrote: > +++ b/perl/Git.pm > +sub parse_email { > + my %mail = (); > + my $fh = shift; > + my $last_header; > + > + # Unfold and parse multiline header fields When you libify, I suggest you localize $/ since $/ may be set to something other than "\n" by a caller and change the behavior of <$fh> and $fh->getline. local $/ = "\n"; > + while (<$fh>) { > + last if /^\s*$/; > + s/\r\n|\n|\r//; And, as Eric Sunshine stated: s/\r?\n$//; Explicitly localizing $/ means you wouldn't have to worry about multiple \n showing up in the line, either. And chomp/chop wouldn't work, here. Otherwise I like the move to Git.pm, thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
Eric Sunshinewrites: > An embedded CR probably shouldn't happen, but I'm not convinced that > folding it out is a good idea. I would think that you'd want to > preserve the header's value verbatim. If anything, I'd expect to see > the regex tightened to: > > s/\r?\n$//; Yes, that would be more sensible than silently removing \r in the middle which _is_ a sign of something funny going on. > Alternately, consider using 'chop' or 'chomp'. Even if you use chomp(), you'd still need to worry about possible \r at the end, no? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On Wed, Jun 8, 2016 at 3:30 PM, Samuel GROOTwrote: > On 06/08/2016 08:12 PM, Eric Sunshine wrote: >> On Wed, Jun 8, 2016 at 1:58 PM, Junio C Hamano wrote: >>> Samuel GROOT writes: >>> The pattern is not anchored at the right end of the string; >>> intended? Is it worth worrying about a lone '\r'? >> >> Thanks, I think you covered pretty much everything I was going to say. >> I'd just add that if the matching is going to be kept loose like this >> (rather than anchoring it), then s/[\r\n]+//g might be easier to read, >> but it's a minor point. > > Indeed s/[\r\n]+//g is way better, it works even if there's a CR in the > middle of the line. An embedded CR probably shouldn't happen, but I'm not convinced that folding it out is a good idea. I would think that you'd want to preserve the header's value verbatim. If anything, I'd expect to see the regex tightened to: s/\r?\n$//; Alternately, consider using 'chop' or 'chomp'. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On 06/08/2016 09:31 PM, Junio C Hamano wrote: Samuel GROOTwrites: I think it's the best way to do it indeed. Furthermore, we did trim CRs and LFs in header fields, but not in the message, making the subroutine inconsistent. Should we rename the subroutine to `parse_header` or leave it as it is? If it lives inside git-send-email, then parse_header is sufficient as everybody would know it is about e-mail without being told. If it is in Git.pm, then parse_email_header would be more appropriate. It currently lives in Git.pm, following Eric Wong's advice to have a more packaged code. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On 06/08/2016 07:58 PM, Junio C Hamano wrote: Samuel GROOTwrites: +sub parse_email { + my %mail = (); + my $fh = shift; + my $last_header; + # Unfold and parse multiline header fields + while (<$fh>) { + last if /^\s*$/; You stop at the end of fields before the message body starts. + s/\r\n|\n|\r//; The pattern is not anchored at the right end of the string; intended? Is it worth worrying about a lone '\r'? A lone '\r' shouldn't happen, but we are never too careful. It's fixed with what Eric suggested. + if (/^([^\s:]+):[\s]+(.*)$/) { + $last_header = lc($1); + @{$mail{$last_header}} = () + unless defined $mail{$last_header}; + push @{$mail{$last_header}}, $2; + } elsif (/^\s+\S/ and defined $last_header) { + s/^\s+/ /; + push @{$mail{$last_header}}, $_; Even though the comment said "unfold", you do not really do the unfolding here and the caller can (if it wants to) figure out where one logical header was folded in the original into multiple physical lines, because you are returning an arrayref. However, that means the caller still cannot tell what the original was if you are given: X-header: a b c X-header: d as you would return { 'X-header' => ["a b", "c", "d")] } In that sense, it may be better to do a real unfolding here, so that it would return { 'X-header' => ["a b c", "d"] } from here instead? I.e. instead of "push @{...}, $_", append $_ to the last element of that array? I will do that. It makes more sense regarding Subject split into several lines, for example. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
Samuel GROOTwrites: > I think it's the best way to do it indeed. Furthermore, we did trim > CRs and LFs in header fields, but not in the message, making the > subroutine inconsistent. > > Should we rename the subroutine to `parse_header` or leave it as it is? If it lives inside git-send-email, then parse_header is sufficient as everybody would know it is about e-mail without being told. If it is in Git.pm, then parse_email_header would be more appropriate. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On 06/08/2016 08:12 PM, Eric Sunshine wrote: On Wed, Jun 8, 2016 at 1:58 PM, Junio C Hamanowrote: Samuel GROOT writes: +sub parse_email { + my %mail = (); + my $fh = shift; + my $last_header; + # Unfold and parse multiline header fields + while (<$fh>) { + last if /^\s*$/; You stop at the end of fields before the message body starts. + s/\r\n|\n|\r//; The pattern is not anchored at the right end of the string; intended? Is it worth worrying about a lone '\r'? Thanks, I think you covered pretty much everything I was going to say. I'd just add that if the matching is going to be kept loose like this (rather than anchoring it), then s/[\r\n]+//g might be easier to read, but it's a minor point. Indeed s/[\r\n]+//g is way better, it works even if there's a CR in the middle of the line. + if (/^([^\s:]+):[\s]+(.*)$/) { + $last_header = lc($1); + @{$mail{$last_header}} = () + unless defined $mail{$last_header}; + push @{$mail{$last_header}}, $2; + } elsif (/^\s+\S/ and defined $last_header) { + s/^\s+/ /; + push @{$mail{$last_header}}, $_; Even though the comment said "unfold", you do not really do the unfolding here and the caller can (if it wants to) figure out where one logical header was folded in the original into multiple physical lines, because you are returning an arrayref. Also, the comment about folding lines should be moved down the part of the code which is actually (supposed to be) doing the folding rather than having the comment at the top of the loop. Will do in next re-roll. + # Separate body from header + $mail{"body"} = [(<$fh>)]; + + return \%mail; The name of the local thing is not observable from the caller, but because this is "parse-email-header" and returns "header fields" without reading the "mail", perhaps call it %header instead? If there is (for some reason) a mail header named 'body', then this assignment of the body portion of the message will overwrite it. Perhaps this function should instead return multiple values: the header hash, and the message body. I will drop the body part in re-roll. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On 06/08/2016 08:32 PM, Junio C Hamano wrote: Eric Sunshinewrites: + # Separate body from header + $mail{"body"} = [(<$fh>)]; + + return \%mail; The name of the local thing is not observable from the caller, but because this is "parse-email-header" and returns "header fields" without reading the "mail", perhaps call it %header instead? If there is (for some reason) a mail header named 'body', then this assignment of the body portion of the message will overwrite it. Perhaps this function should instead return multiple values: the header hash, and the message body. Ah, I missed that it is attempting to return the body, too. Because the function takes an open filehandle, I think it is better to leave it to the callers. A caller that is only interested in headers can just close $fh after this helper returns without reading body that it is not interested in, and a caller that wants to read the body can do the slurping itself. I think it's the best way to do it indeed. Furthermore, we did trim CRs and LFs in header fields, but not in the message, making the subroutine inconsistent. Should we rename the subroutine to `parse_header` or leave it as it is? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
Eric Sunshinewrites: >>> + # Separate body from header >>> + $mail{"body"} = [(<$fh>)]; >>> + >>> + return \%mail; >> >> The name of the local thing is not observable from the caller, but >> because this is "parse-email-header" and returns "header fields" >> without reading the "mail", perhaps call it %header instead? > > If there is (for some reason) a mail header named 'body', then this > assignment of the body portion of the message will overwrite it. > Perhaps this function should instead return multiple values: the > header hash, and the message body. Ah, I missed that it is attempting to return the body, too. Because the function takes an open filehandle, I think it is better to leave it to the callers. A caller that is only interested in headers can just close $fh after this helper returns without reading body that it is not interested in, and a caller that wants to read the body can do the slurping itself. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
On Wed, Jun 8, 2016 at 1:58 PM, Junio C Hamanowrote: > Samuel GROOT writes: >> +sub parse_email { >> + my %mail = (); >> + my $fh = shift; >> + my $last_header; > >> + # Unfold and parse multiline header fields >> + while (<$fh>) { >> + last if /^\s*$/; > > You stop at the end of fields before the message body starts. > >> + s/\r\n|\n|\r//; > > The pattern is not anchored at the right end of the string; > intended? Is it worth worrying about a lone '\r'? Thanks, I think you covered pretty much everything I was going to say. I'd just add that if the matching is going to be kept loose like this (rather than anchoring it), then s/[\r\n]+//g might be easier to read, but it's a minor point. >> + if (/^([^\s:]+):[\s]+(.*)$/) { >> + $last_header = lc($1); >> + @{$mail{$last_header}} = () >> + unless defined $mail{$last_header}; >> + push @{$mail{$last_header}}, $2; > >> + } elsif (/^\s+\S/ and defined $last_header) { >> + s/^\s+/ /; >> + push @{$mail{$last_header}}, $_; > > Even though the comment said "unfold", you do not really do the > unfolding here and the caller can (if it wants to) figure out where > one logical header was folded in the original into multiple physical > lines, because you are returning an arrayref. Also, the comment about folding lines should be moved down the part of the code which is actually (supposed to be) doing the folding rather than having the comment at the top of the loop. > However, that means the caller still cannot tell what the original > was if you are given: > > X-header: a b > c > X-header: d > > as you would return { 'X-header' => ["a b", "c", "d")] } > > In that sense, it may be better to do a real unfolding here, so that > it would return { 'X-header' => ["a b c", "d"] } from here instead? > > I.e. instead of "push @{...}, $_", append $_ to the last element of > that array? Right. >> + # Separate body from header >> + $mail{"body"} = [(<$fh>)]; >> + >> + return \%mail; > > The name of the local thing is not observable from the caller, but > because this is "parse-email-header" and returns "header fields" > without reading the "mail", perhaps call it %header instead? If there is (for some reason) a mail header named 'body', then this assignment of the body portion of the message will overwrite it. Perhaps this function should instead return multiple values: the header hash, and the message body. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] send-email: create email parser subroutine
Samuel GROOTwrites: > +sub parse_email { > + my %mail = (); > + my $fh = shift; > + my $last_header; > + # Unfold and parse multiline header fields > + while (<$fh>) { > + last if /^\s*$/; You stop at the end of fields before the message body starts. > + s/\r\n|\n|\r//; The pattern is not anchored at the right end of the string; intended? Is it worth worrying about a lone '\r'? > + if (/^([^\s:]+):[\s]+(.*)$/) { > + $last_header = lc($1); > + @{$mail{$last_header}} = () > + unless defined $mail{$last_header}; > + push @{$mail{$last_header}}, $2; > + } elsif (/^\s+\S/ and defined $last_header) { > + s/^\s+/ /; > + push @{$mail{$last_header}}, $_; Even though the comment said "unfold", you do not really do the unfolding here and the caller can (if it wants to) figure out where one logical header was folded in the original into multiple physical lines, because you are returning an arrayref. However, that means the caller still cannot tell what the original was if you are given: X-header: a b c X-header: d as you would return { 'X-header' => ["a b", "c", "d")] } In that sense, it may be better to do a real unfolding here, so that it would return { 'X-header' => ["a b c", "d"] } from here instead? I.e. instead of "push @{...}, $_", append $_ to the last element of that array? > + } else { > + die("Mail format undefined!\n"); What does that mean? It would probably help if you included the line that the code did not understand in the message. > + } > + } > + > + # Separate body from header > + $mail{"body"} = [(<$fh>)]; > + > + return \%mail; The name of the local thing is not observable from the caller, but because this is "parse-email-header" and returns "header fields" without reading the "mail", perhaps call it %header instead? > +} -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html