Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
On Jan 13, 2015, at 10:11 PM, Karl Williamson pub...@khwilliamson.com wrote: Nobody has explained to me why we accept uppercase when all existing pod commands (I believe) are entirely lowercase. And shouldn't the digits only be in the final position? I don't know the answers, but am just pointing out potential problems. I think it is an historical thing. No harm in it, really. I would not be surprised if there were uppercase Pod commands in the wild. David smime.p7s Description: S/MIME cryptographic signature
Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
On Jan 12, 2015, at 11:42 AM, David E. Wheeler da...@justatheory.com wrote: Honest, since the current regex matches stuff that is not in fact Pod, I think it is reasonable to tighten up the regex to /\A=([a-zA-Z]+[0=9]*)\b/ That one, it turns out, was no less liberal than the previous regex. I added a test matching the pattern Randy identified, and it failed with this regex, too. So I instead copied the regex from later in the file, which *is* sufficiently more strict, and brings them into line, to boot. The change is here: https://github.com/theory/pod-simple/commit/31942ec Look good? If so, I will update perlpodspec to match it and send it off to p5p. Best, David smime.p7s Description: S/MIME cryptographic signature
AW: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
Right. Now I remember old threads where people would argue that POD parsers should do exactly the same as the Perl parser - and IIRC the conclusion was that using something like PPI to handle pathological cases like multiline strings or here docs would be an overkill, so POD is what starts with a (valid) POD directive. The only thing that perhaps could be changed is to skip a __DATA__ section (but keep parsing since there may be POD behind __END__ !) I see a potential way of resolving this, but it looks like quite a big effort: the Perl parser could store all text it skips as POD in a similar structure like __DATA__ so that POD parsing utilities could use a pseudo-filehandle like this (reading POD for the current script): while (main::__POD__) { ... } and for other files there could be a special open() discipline to return only POD using the same parser. What do you think? -Marek Von meinem Samsung Galaxy Smartphone gesendet. Ursprüngliche Nachricht Von: Randy Stauner rwstau...@cpan.org Datum:08.01.2015 19:26 (GMT+01:00) An: David E. Wheeler da...@justatheory.com Cc: Marek Rouchal ma...@rouchal.net, Karl Williamson pub...@khwilliamson.com, pod-people@perl.org Betreff: Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
* David E. Wheeler da...@justatheory.com [2015-01-08T00:38:04] I agree that’s too liberal. I suggest /\A=([a-zA-Z]+\d*)\b/ trolling? Surely you want [0-9] instead of \d, lest we end up with =head୩ ! /trolling? -- rjbs signature.asc Description: Digital signature
Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
IIRC the first liberal rx is to detect start of POD just like the Perl (language) parser does, i.e. it pauses parsing for instructions until the next =cut I think POD parsers should do the same. If the matched pod-start sequence does not match any of the known commands, it's an error condition, and we should discuss what to do then, like - throw exception - print error and/or call error callback - warn and treat the content as a plain text paragraph -Marek Von meinem Samsung Galaxy Smartphone gesendet. Ursprüngliche Nachricht Von: David E. Wheeler da...@justatheory.com Datum:08.01.2015 06:39 (GMT+01:00) An: Karl Williamson pub...@khwilliamson.com Cc: Randy Stauner rwstau...@cpan.org, pod-people@perl.org Betreff: Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
On Jan 7, 2015, at 10:18 PM, Marek Rouchal ma...@rouchal.net wrote: IIRC the first liberal rx is to detect start of POD just like the Perl (language) parser does, i.e. it pauses parsing for instructions until the next =cut Oh. Can someone dig into the Perl parser and confirm this? I think POD parsers should do the same. My suspicion is that, even if that’s true, the Parser ignores everything in a __DATA__ or __END__ block. Anyway, even if Perl is more lenient, that doesn’t mean a Pod parser needs to be. What is and is not valid Pod is quite well-defined in perlpodspec, so I suspect taht we can afford to be a bit stricter. If the matched pod-start sequence does not match any of the known commands, it's an error condition, and we should discuss what to do then, like - throw exception - print error and/or call error callback - warn and treat the content as a plain text paragraph It might be valid Perl. my $foo = q{ =sîî }; So I think it would be better just to be stricter in what we consider to be Pod. Best, David smime.p7s Description: S/MIME cryptographic signature
Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
On 01/08/2015 11:17 AM, Randy Stauner wrote: IIRC the first liberal rx is to detect start of POD just like the Perl (language) parser does, i.e. it pauses parsing for instructions until the next =cut Oh. Can someone dig into the Perl parser and confirm this? I think POD parsers should do the same. My suspicion is that, even if that’s true, the Parser ignores everything in a __DATA__ or __END__ block. Here is an example I worked up when writing test for metacpan: Everything after __DATA__ is data, but the pod parser will also find pod if it's there https://gist.github.com/rwstauner/98f97e6cd64c972d9b71 I don't understand the parser very well, but if someone wants a crack at it, here is the only portion of it that sets to being in pod. The context is that the first character on the line is an =, and tmp holds the character that follows that =. I think 's' points to the input starting at tmp, so that tmp == *s: if (PL_expect == XSTATE isALPHA(tmp) (s == PL_linestart+1 || s[-2] == '\n') ) { if ((PL_in_eval !PL_rsfp !PL_parser-filtered) || PL_lex_state != LEX_NORMAL) { d = PL_bufend; while (s d) { if (*s++ == '\n') { incline(s); if (strnEQ(s,=cut,4)) { s = strchr(s,'\n'); if (s) s++; else s = d; incline(s); goto retry; } } } goto retry; } s = PL_bufend; PL_parser-in_pod = 1; goto retry; }
Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
On Jan 7, 2015, at 11:30 AM, Karl Williamson pub...@khwilliamson.com wrote: I asked David about the inconsistency and he asked that I bring it up here. Shouldn't the more strict regexp be used in both places? I think so. Looking at the regexes though, I didn't know that directives could be capitals, and I thought that digits had to always be the last character (or characters ?) in a directive. It seems to me that both regexes should be tightened. perlpodspec says: Pod content is contained in Pod blocks. A Pod block starts with a line that matches m/\A=[a-zA-Z]/, and continues up to the next line that matches m/\A=cut/ or up to the end of the file if there is no m/\A=cut/ line. I agree that’s too liberal. I suggest /\A=([a-zA-Z]+\d*)\b/ On the first pass the parser marks the line as pod (presumably matching a directive) but on the second pass the line doesn't match any patterns and it all falls through as a paragraph. This inconsistency allows binary data to be treated as a pod document. Is there a recommended way to parse the pod out of a document that might have binary data in it? I don't know about this. It seems to me that if the second match does not think it is Pod, then it should not be a paragraph (unless it was already in a pod section from a previous declaration). I suspect that if we tighten up the first regex as I suggest year, and sync the second with it, we should be okay. Thoughts? Best, David smime.p7s Description: S/MIME cryptographic signature
Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
On 01/06/2015 07:55 AM, Randy Stauner wrote: This came up in discussing a metacpan bug (https://github.com/CPAN-API/cpan-api/issues/364#issuecomment-66864855)... A perl module can technically have perl code, pod, and even spans of binary (in a data token, or maybe even a here doc). To my surprise, the pod parser matched a line like =F\0 in the binary blob and began treating the document as pod. The matching is inconsistent though: A very liberal regexp matched the binary and triggered the start of the document: if($line =~ m/^=([a-zA-Z]+)/s) { https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L158 Later on the line is re-processed to see what kind of pod it is and no longer matches the more strict regexp: if($line =~ m/^(=[a-zA-Z][a-zA-Z0-9]*)(?:\s+|$)(.*)/s) { https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L243 So in a document that had no pod, the pod parser returned a bunch of binary blobs because it matched a very loose regexp, started the document, and then found no actual pod (so basically everything afterwards is treated as a pod paragraph). I asked David about the inconsistency and he asked that I bring it up here. Shouldn't the more strict regexp be used in both places? I think so. Looking at the regexes though, I didn't know that directives could be capitals, and I thought that digits had to always be the last character (or characters ?) in a directive. It seems to me that both regexes should be tightened. On the first pass the parser marks the line as pod (presumably matching a directive) but on the second pass the line doesn't match any patterns and it all falls through as a paragraph. This inconsistency allows binary data to be treated as a pod document. Is there a recommended way to parse the pod out of a document that might have binary data in it? I don't know about this.
Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
This came up in discussing a metacpan bug ( https://github.com/CPAN-API/cpan-api/issues/364#issuecomment-66864855)... A perl module can technically have perl code, pod, and even spans of binary (in a data token, or maybe even a here doc). To my surprise, the pod parser matched a line like =F\0 in the binary blob and began treating the document as pod. The matching is inconsistent though: A very liberal regexp matched the binary and triggered the start of the document: if($line =~ m/^=([a-zA-Z]+)/s) { https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L158 Later on the line is re-processed to see what kind of pod it is and no longer matches the more strict regexp: if($line =~ m/^(=[a-zA-Z][a-zA-Z0-9]*)(?:\s+|$)(.*)/s) { https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L243 So in a document that had no pod, the pod parser returned a bunch of binary blobs because it matched a very loose regexp, started the document, and then found no actual pod (so basically everything afterwards is treated as a pod paragraph). I asked David about the inconsistency and he asked that I bring it up here. Shouldn't the more strict regexp be used in both places? On the first pass the parser marks the line as pod (presumably matching a directive) but on the second pass the line doesn't match any patterns and it all falls through as a paragraph. This inconsistency allows binary data to be treated as a pod document. Is there a recommended way to parse the pod out of a document that might have binary data in it?