Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-14 Thread David E. Wheeler
On Jan 13, 2015, at 10:11 PM, Karl Williamson pub...@khwilliamson.com wrote:

 Nobody has explained to me why we accept uppercase when all existing pod 
 commands (I believe) are entirely lowercase.  And shouldn't the digits only 
 be in the final position?  I don't know the answers, but am just pointing out 
 potential problems.

I think it is an historical thing. No harm in it, really. I would not be 
surprised if there were uppercase Pod commands in the wild.

David



smime.p7s
Description: S/MIME cryptographic signature


Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-13 Thread David E. Wheeler
On Jan 12, 2015, at 11:42 AM, David E. Wheeler da...@justatheory.com wrote:

 Honest, since the current regex matches stuff that is not in fact Pod, I 
 think it is reasonable to tighten up the regex to
 
/\A=([a-zA-Z]+[0=9]*)\b/

That one, it turns out, was no less liberal than the previous regex. I added a 
test matching the pattern Randy identified, and it failed with this regex, too. 
So I instead copied the regex from later in the file, which *is* sufficiently 
more strict, and brings them into line, to boot. The change is here:

  https://github.com/theory/pod-simple/commit/31942ec

Look good? If so, I will update perlpodspec to match it and send it off to p5p.

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


AW: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-08 Thread Marek Rouchal
Right. Now I remember old threads where people would argue that POD parsers 
should do exactly the same as the Perl parser - and IIRC the conclusion was 
that using something like PPI to handle pathological cases like multiline 
strings or here docs would be an overkill, so POD is what starts with a (valid) 
POD directive. The only thing that perhaps could be changed is to skip a 
__DATA__ section (but keep parsing since there may be POD behind __END__ !)

I see a potential way of resolving this, but it looks like quite a big effort: 
the Perl parser could store all text it skips as POD in a similar structure 
like __DATA__ so that POD parsing utilities could use a pseudo-filehandle like 
this (reading POD for the current script):

while (main::__POD__) {
...
}

and for other files there could be a special open() discipline to return only 
POD using the same parser. What do you think?

-Marek


Von meinem Samsung Galaxy Smartphone gesendet.


 Ursprüngliche Nachricht 
Von: Randy Stauner rwstau...@cpan.org 
Datum:08.01.2015  19:26  (GMT+01:00) 
An: David E. Wheeler da...@justatheory.com 
Cc: Marek Rouchal ma...@rouchal.net, Karl Williamson 
pub...@khwilliamson.com, pod-people@perl.org 
Betreff: Re: Pod::Simple can treat binary as pod due to liberal/inconsistent 
regexp patterns 



Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-08 Thread Ricardo Signes
* David E. Wheeler da...@justatheory.com [2015-01-08T00:38:04]
 I agree that’s too liberal. I suggest
 
 /\A=([a-zA-Z]+\d*)\b/

trolling?
Surely you want [0-9] instead of \d, lest we end up with =head୩ !
/trolling?

-- 
rjbs


signature.asc
Description: Digital signature


Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-08 Thread Marek Rouchal
IIRC the first liberal rx is to detect start of POD just like the Perl 
(language) parser does, i.e. it pauses parsing for instructions until the next 
=cut
I think POD parsers should do the same. If the matched pod-start sequence does 
not match any of the known commands, it's an error condition, and we should 
discuss what to do then, like 
- throw exception 
- print error and/or call error callback
- warn and treat the content as a plain text paragraph

-Marek


Von meinem Samsung Galaxy Smartphone gesendet.


 Ursprüngliche Nachricht 
Von: David E. Wheeler da...@justatheory.com 
Datum:08.01.2015  06:39  (GMT+01:00) 
An: Karl Williamson pub...@khwilliamson.com 
Cc: Randy Stauner rwstau...@cpan.org, pod-people@perl.org 
Betreff: Re: Pod::Simple can treat binary as pod due to liberal/inconsistent 
regexp patterns 



Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-08 Thread David E. Wheeler
On Jan 7, 2015, at 10:18 PM, Marek Rouchal ma...@rouchal.net wrote:

 IIRC the first liberal rx is to detect start of POD just like the Perl 
 (language) parser does, i.e. it pauses parsing for instructions until the 
 next =cut

Oh. Can someone dig into the Perl parser and confirm this?

 I think POD parsers should do the same.

My suspicion is that, even if that’s true, the Parser ignores everything in a 
__DATA__ or __END__ block.

Anyway, even if Perl is more lenient, that doesn’t mean a Pod parser needs to 
be. What is and is not valid Pod is quite well-defined in perlpodspec, so I 
suspect taht we can afford to be a bit stricter.

 If the matched pod-start sequence does not match any of the known commands, 
 it's an error condition, and we should discuss what to do then, like 
 - throw exception 
 - print error and/or call error callback
 - warn and treat the content as a plain text paragraph

It might be valid Perl.

my $foo = q{
=sîî
};

So I think it would be better just to be stricter in what we consider to be Pod.

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-08 Thread Karl Williamson

On 01/08/2015 11:17 AM, Randy Stauner wrote:

 IIRC the first liberal rx is to detect start of POD just like the Perl 
(language) parser does, i.e. it pauses parsing for instructions until the next =cut

Oh. Can someone dig into the Perl parser and confirm this?

 I think POD parsers should do the same.

My suspicion is that, even if that’s true, the Parser ignores
everything in a __DATA__ or __END__ block.


Here is an example I worked up when writing test for metacpan:
Everything after __DATA__ is data, but the pod parser will also find pod
if it's there
https://gist.github.com/rwstauner/98f97e6cd64c972d9b71



I don't understand the parser very well, but if someone wants a crack at 
it, here is the only portion of it that sets to being in pod.  The 
context is that the first character on the line is an =, and tmp holds 
the character that follows that =.  I think 's' points to the input 
starting at tmp, so that tmp == *s:


if (PL_expect == XSTATE  isALPHA(tmp) 
(s == PL_linestart+1 || s[-2] == '\n') )
{
if ((PL_in_eval  !PL_rsfp  !PL_parser-filtered)
|| PL_lex_state != LEX_NORMAL) {
d = PL_bufend;
while (s  d) {
if (*s++ == '\n') {
incline(s);
if (strnEQ(s,=cut,4)) {
s = strchr(s,'\n');
if (s)
s++;
else
s = d;
incline(s);
goto retry;
}
}
}
goto retry;
}
s = PL_bufend;
PL_parser-in_pod = 1;
goto retry;
}



Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-07 Thread David E. Wheeler
On Jan 7, 2015, at 11:30 AM, Karl Williamson pub...@khwilliamson.com wrote:

 I asked David about the inconsistency and he asked that I bring it up here.
 
 Shouldn't the more strict regexp be used in both places?
 
 I think so.  Looking at the regexes though, I didn't know that directives 
 could be capitals, and I thought that digits had to always be the last 
 character (or characters ?) in a directive.  It seems to me that both regexes 
 should be tightened.

perlpodspec says:

 Pod content is contained in Pod blocks. A Pod block starts with a line
 that matches m/\A=[a-zA-Z]/, and continues up to the next line that
 matches m/\A=cut/ or up to the end of the file if there is no
 m/\A=cut/ line.

I agree that’s too liberal. I suggest

/\A=([a-zA-Z]+\d*)\b/

 On the first pass the parser marks the line as pod (presumably matching
 a directive)
 but on the second pass the line doesn't match any patterns and it all
 falls through as a paragraph.
 
 This inconsistency allows binary data to be treated as a pod document.
 Is there a recommended way to parse the pod out of a document that might
 have binary data in it?
 
 I don't know about this.

It seems to me that if the second match does not think it is Pod, then it 
should not be a paragraph (unless it was already in a pod section from a 
previous declaration). I suspect that if we tighten up the first regex as I 
suggest year, and sync the second with it, we should be okay. Thoughts?

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-07 Thread Karl Williamson

On 01/06/2015 07:55 AM, Randy Stauner wrote:

This came up in discussing a metacpan bug
(https://github.com/CPAN-API/cpan-api/issues/364#issuecomment-66864855)...

A perl module can technically have perl code, pod, and even spans of
binary (in a data token, or maybe even a here doc).

To my surprise, the pod parser matched a line like =F\0 in the binary
blob and began treating the document as pod.

The matching is inconsistent though:
A very liberal regexp matched the binary and triggered the start of the
document:

if($line =~ m/^=([a-zA-Z]+)/s) {

https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L158

Later on the line is re-processed to see what kind of pod it is and no
longer matches the more strict regexp:

if($line =~ m/^(=[a-zA-Z][a-zA-Z0-9]*)(?:\s+|$)(.*)/s) {

https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L243

So in a document that had no pod, the pod parser returned a bunch of
binary blobs
because it matched a very loose regexp, started the document, and then
found no actual pod (so basically everything afterwards is treated as a
pod paragraph).

I asked David about the inconsistency and he asked that I bring it up here.

Shouldn't the more strict regexp be used in both places?


I think so.  Looking at the regexes though, I didn't know that 
directives could be capitals, and I thought that digits had to always be 
the last character (or characters ?) in a directive.  It seems to me 
that both regexes should be tightened.




On the first pass the parser marks the line as pod (presumably matching
a directive)
but on the second pass the line doesn't match any patterns and it all
falls through as a paragraph.

This inconsistency allows binary data to be treated as a pod document.
Is there a recommended way to parse the pod out of a document that might
have binary data in it?


I don't know about this.



Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns

2015-01-06 Thread Randy Stauner
This came up in discussing a metacpan bug (
https://github.com/CPAN-API/cpan-api/issues/364#issuecomment-66864855)...

A perl module can technically have perl code, pod, and even spans of binary
(in a data token, or maybe even a here doc).

To my surprise, the pod parser matched a line like =F\0 in the binary
blob and began treating the document as pod.

The matching is inconsistent though:
A very liberal regexp matched the binary and triggered the start of the
document:

if($line =~ m/^=([a-zA-Z]+)/s) {

https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L158

Later on the line is re-processed to see what kind of pod it is and no
longer matches the more strict regexp:

if($line =~ m/^(=[a-zA-Z][a-zA-Z0-9]*)(?:\s+|$)(.*)/s) {

https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L243

So in a document that had no pod, the pod parser returned a bunch of binary
blobs
because it matched a very loose regexp, started the document, and then
found no actual pod (so basically everything afterwards is treated as a pod
paragraph).

I asked David about the inconsistency and he asked that I bring it up here.

Shouldn't the more strict regexp be used in both places?
On the first pass the parser marks the line as pod (presumably matching a
directive)
but on the second pass the line doesn't match any patterns and it all falls
through as a paragraph.

This inconsistency allows binary data to be treated as a pod document.
Is there a recommended way to parse the pod out of a document that might
have binary data in it?