Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-14 Thread Samuel GROOT

On 06/14/2016 12:47 AM, Eric Wong wrote:

Samuel GROOT  wrote:

On 06/09/2016 02:21 AM, Eric Wong wrote:

Samuel GROOT  wrote:

Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1].
Should we handle \n\r at end of line as well?


"\n\r" can never happen with local $/ = "\n"


If the email file contains "\n\r", setting $/ = "\n" will leave "\r" at
the beginning of each line.

We could trim them with:

  s/^\r//;
  s/\r?\n$//;

But is it worth adding `s/^\r//;` to handle that extremely rare case?


I doubt it.  Having a "\r" in the wrong place is likely a bug in
whatever program that generated the email.  It should be exposed
so whoever generated that email has a chance to fix it on their
end rather than being quietly hidden.


s/\r?\n$// is fine then.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-13 Thread Eric Wong
Samuel GROOT  wrote:
> On 06/09/2016 02:21 AM, Eric Wong wrote:
> >Samuel GROOT  wrote:
> >>Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1].
> >>Should we handle \n\r at end of line as well?
> >
> >"\n\r" can never happen with local $/ = "\n"
> 
> If the email file contains "\n\r", setting $/ = "\n" will leave "\r" at
> the beginning of each line.
> 
> We could trim them with:
> 
>   s/^\r//;
>   s/\r?\n$//;
> 
> But is it worth adding `s/^\r//;` to handle that extremely rare case?

I doubt it.  Having a "\r" in the wrong place is likely a bug in
whatever program that generated the email.  It should be exposed
so whoever generated that email has a chance to fix it on their
end rather than being quietly hidden.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-13 Thread Samuel GROOT

On 06/09/2016 02:21 AM, Eric Wong wrote:

Samuel GROOT  wrote:

Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1].
Should we handle \n\r at end of line as well?


"\n\r" can never happen with local $/ = "\n"


If the email file contains "\n\r", setting $/ = "\n" will leave "\r" at
the beginning of each line.

We could trim them with:

  s/^\r//;
  s/\r?\n$//;

But is it worth adding `s/^\r//;` to handle that extremely rare case?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-13 Thread Samuel GROOT



On 06/09/2016 08:51 AM, Eric Sunshine wrote:

On Wed, Jun 8, 2016 at 7:54 PM, Samuel GROOT
 wrote:

On 06/08/2016 10:17 PM, Junio C Hamano wrote:

Eric Sunshine  writes:

An embedded CR probably shouldn't happen, but I'm not convinced that
folding it out is a good idea. I would think that you'd want to
preserve the header's value verbatim. If anything, I'd expect to see
the regex tightened to:

s/\r?\n$//;


Yes, that would be more sensible than silently removing \r in the
middle which _is_ a sign of something funny going on.


s/\r?\n$// looks fine.

Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1].
[1] * http://cpansearch.perl.org/src/RJBS/Email-Simple-2.210/lib/Email/Simple.pm


You could certainly use s/\x0d?\x0a$// rather than s/\r?\n$// to be
really robust, but I doubt it matters much today. The reason for using
hex codes is that \r and \n will resolve to CR and LF in the local
character encoding, and those may not be 0x0d and 0x0a, respectively.
I believe the canonical example given in the Camel book was EBCIDIC in
which \r is 0x0d, but \n is 0x25, not 0x0a as it is in ASCII.


My question was more about the `\n\r` case handled by Email::Simple, 
does it make sense to handle it?

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-09 Thread Eric Sunshine
On Wed, Jun 8, 2016 at 7:54 PM, Samuel GROOT
 wrote:
> On 06/08/2016 10:17 PM, Junio C Hamano wrote:
>> Eric Sunshine  writes:
>>> An embedded CR probably shouldn't happen, but I'm not convinced that
>>> folding it out is a good idea. I would think that you'd want to
>>> preserve the header's value verbatim. If anything, I'd expect to see
>>> the regex tightened to:
>>>
>>> s/\r?\n$//;
>>
>> Yes, that would be more sensible than silently removing \r in the
>> middle which _is_ a sign of something funny going on.
>
> s/\r?\n$// looks fine.
>
> Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1].
> [1] * 
> http://cpansearch.perl.org/src/RJBS/Email-Simple-2.210/lib/Email/Simple.pm

You could certainly use s/\x0d?\x0a$// rather than s/\r?\n$// to be
really robust, but I doubt it matters much today. The reason for using
hex codes is that \r and \n will resolve to CR and LF in the local
character encoding, and those may not be 0x0d and 0x0a, respectively.
I believe the canonical example given in the Camel book was EBCIDIC in
which \r is 0x0d, but \n is 0x25, not 0x0a as it is in ASCII.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Eric Wong
Samuel GROOT  wrote:
> Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1].
> Should we handle \n\r at end of line as well?

"\n\r" can never happen with local $/ = "\n"
 
> [1] * 
> http://cpansearch.perl.org/src/RJBS/Email-Simple-2.210/lib/Email/Simple.pm
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Samuel GROOT

On 06/08/2016 10:17 PM, Junio C Hamano wrote:

Eric Sunshine  writes:


An embedded CR probably shouldn't happen, but I'm not convinced that
folding it out is a good idea. I would think that you'd want to
preserve the header's value verbatim. If anything, I'd expect to see
the regex tightened to:

s/\r?\n$//;


Yes, that would be more sensible than silently removing \r in the
middle which _is_ a sign of something funny going on.

Alternately, consider using 'chop' or 'chomp'.


Even if you use chomp(), you'd still need to worry about possible \r
at the end, no?


'chomp' is what we used before, but with *.eml files (microsoft's file 
format, with CRLF), '\n' were removed but '\r' remained, that's why we 
used s/\r\n|\r|\n//.


s/\r?\n$// looks fine.

Email::Simple library uses qr/\x0a\x0d|\x0d\x0a|\x0a|\x0d/ [1]. Should 
we handle \n\r at end of line as well?


[1] * 
http://cpansearch.perl.org/src/RJBS/Email-Simple-2.210/lib/Email/Simple.pm

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Eric Wong
Samuel GROOT  wrote:
> +++ b/perl/Git.pm

> +sub parse_email {
> + my %mail = ();
> + my $fh = shift;
> + my $last_header;
> +
> + # Unfold and parse multiline header fields

When you libify, I suggest you localize $/ since $/
may be set to something other than "\n" by a caller
and change the behavior of <$fh> and $fh->getline.

local $/ = "\n";

> + while (<$fh>) {
> + last if /^\s*$/;
> + s/\r\n|\n|\r//;

And, as Eric Sunshine stated:

s/\r?\n$//;

Explicitly localizing $/ means you wouldn't have to worry about
multiple \n showing up in the line, either.
And chomp/chop wouldn't work, here.

Otherwise I like the move to Git.pm, thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Junio C Hamano
Eric Sunshine  writes:

> An embedded CR probably shouldn't happen, but I'm not convinced that
> folding it out is a good idea. I would think that you'd want to
> preserve the header's value verbatim. If anything, I'd expect to see
> the regex tightened to:
>
> s/\r?\n$//;

Yes, that would be more sensible than silently removing \r in the
middle which _is_ a sign of something funny going on.

> Alternately, consider using 'chop' or 'chomp'.

Even if you use chomp(), you'd still need to worry about possible \r
at the end, no?

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Eric Sunshine
On Wed, Jun 8, 2016 at 3:30 PM, Samuel GROOT
 wrote:
> On 06/08/2016 08:12 PM, Eric Sunshine wrote:
>> On Wed, Jun 8, 2016 at 1:58 PM, Junio C Hamano  wrote:
>>> Samuel GROOT  writes:
>>> The pattern is not anchored at the right end of the string;
>>> intended?  Is it worth worrying about a lone '\r'?
>>
>> Thanks, I think you covered pretty much everything I was going to say.
>> I'd just add that if the matching is going to be kept loose like this
>> (rather than anchoring it), then s/[\r\n]+//g might be easier to read,
>> but it's a minor point.
>
> Indeed s/[\r\n]+//g is way better, it works even if there's a CR in the
> middle of the line.

An embedded CR probably shouldn't happen, but I'm not convinced that
folding it out is a good idea. I would think that you'd want to
preserve the header's value verbatim. If anything, I'd expect to see
the regex tightened to:

s/\r?\n$//;

Alternately, consider using 'chop' or 'chomp'.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Samuel GROOT

On 06/08/2016 09:31 PM, Junio C Hamano wrote:

Samuel GROOT  writes:


I think it's the best way to do it indeed. Furthermore, we did trim
CRs and LFs in header fields, but not in the message, making the
subroutine inconsistent.

Should we rename the subroutine to `parse_header` or leave it as it is?


If it lives inside git-send-email, then parse_header is sufficient
as everybody would know it is about e-mail without being told.  If
it is in Git.pm, then parse_email_header would be more appropriate.


It currently lives in Git.pm, following Eric Wong's advice to have a 
more packaged code.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Samuel GROOT

On 06/08/2016 07:58 PM, Junio C Hamano wrote:

Samuel GROOT  writes:

+sub parse_email {
+   my %mail = ();
+   my $fh = shift;
+   my $last_header;



+   # Unfold and parse multiline header fields
+   while (<$fh>) {
+   last if /^\s*$/;


You stop at the end of fields before the message body starts.


+   s/\r\n|\n|\r//;


The pattern is not anchored at the right end of the string;
intended?  Is it worth worrying about a lone '\r'?


A lone '\r' shouldn't happen, but we are never too careful. It's fixed 
with what Eric suggested.



+   if (/^([^\s:]+):[\s]+(.*)$/) {
+   $last_header = lc($1);
+   @{$mail{$last_header}} = ()
+   unless defined $mail{$last_header};
+   push @{$mail{$last_header}}, $2;



+   } elsif (/^\s+\S/ and defined $last_header) {
+   s/^\s+/ /;
+   push @{$mail{$last_header}}, $_;


Even though the comment said "unfold", you do not really do the
unfolding here and the caller can (if it wants to) figure out where
one logical header was folded in the original into multiple physical
lines, because you are returning an arrayref.

However, that means the caller still cannot tell what the original
was if you are given:

X-header: a b
c
X-header: d

as you would return { 'X-header' => ["a b", "c", "d")] }

In that sense, it may be better to do a real unfolding here, so that
it would return { 'X-header' => ["a b c", "d"] } from here instead?

I.e. instead of "push @{...}, $_", append $_ to the last element of
that array?


I will do that. It makes more sense regarding Subject split into several 
lines, for example.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Junio C Hamano
Samuel GROOT  writes:

> I think it's the best way to do it indeed. Furthermore, we did trim
> CRs and LFs in header fields, but not in the message, making the
> subroutine inconsistent.
>
> Should we rename the subroutine to `parse_header` or leave it as it is?

If it lives inside git-send-email, then parse_header is sufficient
as everybody would know it is about e-mail without being told.  If
it is in Git.pm, then parse_email_header would be more appropriate.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Samuel GROOT

On 06/08/2016 08:12 PM, Eric Sunshine wrote:

On Wed, Jun 8, 2016 at 1:58 PM, Junio C Hamano  wrote:

Samuel GROOT  writes:

+sub parse_email {
+ my %mail = ();
+ my $fh = shift;
+ my $last_header;



+ # Unfold and parse multiline header fields
+ while (<$fh>) {
+ last if /^\s*$/;


You stop at the end of fields before the message body starts.


+ s/\r\n|\n|\r//;


The pattern is not anchored at the right end of the string;
intended?  Is it worth worrying about a lone '\r'?


Thanks, I think you covered pretty much everything I was going to say.
I'd just add that if the matching is going to be kept loose like this
(rather than anchoring it), then s/[\r\n]+//g might be easier to read,
but it's a minor point.


Indeed s/[\r\n]+//g is way better, it works even if there's a CR in the 
middle of the line.



+ if (/^([^\s:]+):[\s]+(.*)$/) {
+ $last_header = lc($1);
+ @{$mail{$last_header}} = ()
+ unless defined $mail{$last_header};
+ push @{$mail{$last_header}}, $2;



+ } elsif (/^\s+\S/ and defined $last_header) {
+ s/^\s+/ /;
+ push @{$mail{$last_header}}, $_;


Even though the comment said "unfold", you do not really do the
unfolding here and the caller can (if it wants to) figure out where
one logical header was folded in the original into multiple physical
lines, because you are returning an arrayref.


Also, the comment about folding lines should be moved down the part of
the code which is actually (supposed to be) doing the folding rather
than having the comment at the top of the loop.


Will do in next re-roll.


+ # Separate body from header
+ $mail{"body"} = [(<$fh>)];
+
+ return \%mail;


The name of the local thing is not observable from the caller, but
because this is "parse-email-header" and returns "header fields"
without reading the "mail", perhaps call it %header instead?


If there is (for some reason) a mail header named 'body', then this
assignment of the body portion of the message will overwrite it.
Perhaps this function should instead return multiple values: the
header hash, and the message body.


I will drop the body part in re-roll.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Samuel GROOT

On 06/08/2016 08:32 PM, Junio C Hamano wrote:

Eric Sunshine  writes:

+ # Separate body from header
+ $mail{"body"} = [(<$fh>)];
+
+ return \%mail;


The name of the local thing is not observable from the caller, but
because this is "parse-email-header" and returns "header fields"
without reading the "mail", perhaps call it %header instead?


If there is (for some reason) a mail header named 'body', then this
assignment of the body portion of the message will overwrite it.
Perhaps this function should instead return multiple values: the
header hash, and the message body.


Ah, I missed that it is attempting to return the body, too.

Because the function takes an open filehandle, I think it is better
to leave it to the callers.  A caller that is only interested in
headers can just close $fh after this helper returns without reading
body that it is not interested in, and a caller that wants to read
the body can do the slurping itself.


I think it's the best way to do it indeed. Furthermore, we did trim CRs 
and LFs in header fields, but not in the message, making the subroutine 
inconsistent.


Should we rename the subroutine to `parse_header` or leave it as it is?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Junio C Hamano
Eric Sunshine  writes:

>>> + # Separate body from header
>>> + $mail{"body"} = [(<$fh>)];
>>> +
>>> + return \%mail;
>>
>> The name of the local thing is not observable from the caller, but
>> because this is "parse-email-header" and returns "header fields"
>> without reading the "mail", perhaps call it %header instead?
>
> If there is (for some reason) a mail header named 'body', then this
> assignment of the body portion of the message will overwrite it.
> Perhaps this function should instead return multiple values: the
> header hash, and the message body.

Ah, I missed that it is attempting to return the body, too.

Because the function takes an open filehandle, I think it is better
to leave it to the callers.  A caller that is only interested in
headers can just close $fh after this helper returns without reading
body that it is not interested in, and a caller that wants to read
the body can do the slurping itself.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Eric Sunshine
On Wed, Jun 8, 2016 at 1:58 PM, Junio C Hamano  wrote:
> Samuel GROOT  writes:
>> +sub parse_email {
>> + my %mail = ();
>> + my $fh = shift;
>> + my $last_header;
>
>> + # Unfold and parse multiline header fields
>> + while (<$fh>) {
>> + last if /^\s*$/;
>
> You stop at the end of fields before the message body starts.
>
>> + s/\r\n|\n|\r//;
>
> The pattern is not anchored at the right end of the string;
> intended?  Is it worth worrying about a lone '\r'?

Thanks, I think you covered pretty much everything I was going to say.
I'd just add that if the matching is going to be kept loose like this
(rather than anchoring it), then s/[\r\n]+//g might be easier to read,
but it's a minor point.

>> + if (/^([^\s:]+):[\s]+(.*)$/) {
>> + $last_header = lc($1);
>> + @{$mail{$last_header}} = ()
>> + unless defined $mail{$last_header};
>> + push @{$mail{$last_header}}, $2;
>
>> + } elsif (/^\s+\S/ and defined $last_header) {
>> + s/^\s+/ /;
>> + push @{$mail{$last_header}}, $_;
>
> Even though the comment said "unfold", you do not really do the
> unfolding here and the caller can (if it wants to) figure out where
> one logical header was folded in the original into multiple physical
> lines, because you are returning an arrayref.

Also, the comment about folding lines should be moved down the part of
the code which is actually (supposed to be) doing the folding rather
than having the comment at the top of the loop.

> However, that means the caller still cannot tell what the original
> was if you are given:
>
> X-header: a b
> c
> X-header: d
>
> as you would return { 'X-header' => ["a b", "c", "d")] }
>
> In that sense, it may be better to do a real unfolding here, so that
> it would return { 'X-header' => ["a b c", "d"] } from here instead?
>
> I.e. instead of "push @{...}, $_", append $_ to the last element of
> that array?

Right.

>> + # Separate body from header
>> + $mail{"body"} = [(<$fh>)];
>> +
>> + return \%mail;
>
> The name of the local thing is not observable from the caller, but
> because this is "parse-email-header" and returns "header fields"
> without reading the "mail", perhaps call it %header instead?

If there is (for some reason) a mail header named 'body', then this
assignment of the body portion of the message will overwrite it.
Perhaps this function should instead return multiple values: the
header hash, and the message body.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] send-email: create email parser subroutine

2016-06-08 Thread Junio C Hamano
Samuel GROOT  writes:

> +sub parse_email {
> + my %mail = ();
> + my $fh = shift;
> + my $last_header;

> + # Unfold and parse multiline header fields
> + while (<$fh>) {
> + last if /^\s*$/;

You stop at the end of fields before the message body starts.

> + s/\r\n|\n|\r//;

The pattern is not anchored at the right end of the string;
intended?  Is it worth worrying about a lone '\r'?

> + if (/^([^\s:]+):[\s]+(.*)$/) {
> + $last_header = lc($1);
> + @{$mail{$last_header}} = ()
> + unless defined $mail{$last_header};
> + push @{$mail{$last_header}}, $2;

> + } elsif (/^\s+\S/ and defined $last_header) {
> + s/^\s+/ /;
> + push @{$mail{$last_header}}, $_;

Even though the comment said "unfold", you do not really do the
unfolding here and the caller can (if it wants to) figure out where
one logical header was folded in the original into multiple physical
lines, because you are returning an arrayref.

However, that means the caller still cannot tell what the original
was if you are given:

X-header: a b
c
X-header: d

as you would return { 'X-header' => ["a b", "c", "d")] }

In that sense, it may be better to do a real unfolding here, so that
it would return { 'X-header' => ["a b c", "d"] } from here instead?

I.e. instead of "push @{...}, $_", append $_ to the last element of
that array?

> + } else {
> + die("Mail format undefined!\n");

What does that mean?  It would probably help if you included the
line that the code did not understand in the message.


> + }
> + }
> +
> + # Separate body from header
> + $mail{"body"} = [(<$fh>)];
> +
> + return \%mail;

The name of the local thing is not observable from the caller, but
because this is "parse-email-header" and returns "header fields"
without reading the "mail", perhaps call it %header instead?

> +}
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html