Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
On Jan 7, 2015, at 11:30 AM, Karl Williamson pub...@khwilliamson.com wrote: I asked David about the inconsistency and he asked that I bring it up here. Shouldn't the more strict regexp be used in both places? I think so. Looking at the regexes though, I didn't know that directives could be capitals, and I thought that digits had to always be the last character (or characters ?) in a directive. It seems to me that both regexes should be tightened. perlpodspec says: Pod content is contained in Pod blocks. A Pod block starts with a line that matches m/\A=[a-zA-Z]/, and continues up to the next line that matches m/\A=cut/ or up to the end of the file if there is no m/\A=cut/ line. I agree that’s too liberal. I suggest /\A=([a-zA-Z]+\d*)\b/ On the first pass the parser marks the line as pod (presumably matching a directive) but on the second pass the line doesn't match any patterns and it all falls through as a paragraph. This inconsistency allows binary data to be treated as a pod document. Is there a recommended way to parse the pod out of a document that might have binary data in it? I don't know about this. It seems to me that if the second match does not think it is Pod, then it should not be a paragraph (unless it was already in a pod section from a previous declaration). I suspect that if we tighten up the first regex as I suggest year, and sync the second with it, we should be okay. Thoughts? Best, David smime.p7s Description: S/MIME cryptographic signature
Re: Pod::Simple can treat binary as pod due to liberal/inconsistent regexp patterns
On 01/06/2015 07:55 AM, Randy Stauner wrote: This came up in discussing a metacpan bug (https://github.com/CPAN-API/cpan-api/issues/364#issuecomment-66864855)... A perl module can technically have perl code, pod, and even spans of binary (in a data token, or maybe even a here doc). To my surprise, the pod parser matched a line like =F\0 in the binary blob and began treating the document as pod. The matching is inconsistent though: A very liberal regexp matched the binary and triggered the start of the document: if($line =~ m/^=([a-zA-Z]+)/s) { https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L158 Later on the line is re-processed to see what kind of pod it is and no longer matches the more strict regexp: if($line =~ m/^(=[a-zA-Z][a-zA-Z0-9]*)(?:\s+|$)(.*)/s) { https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L243 So in a document that had no pod, the pod parser returned a bunch of binary blobs because it matched a very loose regexp, started the document, and then found no actual pod (so basically everything afterwards is treated as a pod paragraph). I asked David about the inconsistency and he asked that I bring it up here. Shouldn't the more strict regexp be used in both places? I think so. Looking at the regexes though, I didn't know that directives could be capitals, and I thought that digits had to always be the last character (or characters ?) in a directive. It seems to me that both regexes should be tightened. On the first pass the parser marks the line as pod (presumably matching a directive) but on the second pass the line doesn't match any patterns and it all falls through as a paragraph. This inconsistency allows binary data to be treated as a pod document. Is there a recommended way to parse the pod out of a document that might have binary data in it? I don't know about this.