This came up in discussing a metacpan bug (
https://github.com/CPAN-API/cpan-api/issues/364#issuecomment-66864855)...

A perl module can technically have perl code, pod, and even spans of binary
(in a data token, or maybe even a here doc).

To my surprise, the pod parser matched a line like "=F\0" in the binary
blob and began treating the document as pod.

The matching is inconsistent though:
A very liberal regexp matched the binary and triggered the start of the
document:

if($line =~ m/^=([a-zA-Z]+)/s) {

https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L158

Later on the line is re-processed to see what kind of pod it is and no
longer matches the more strict regexp:

if($line =~ m/^(=[a-zA-Z][a-zA-Z0-9]*)(?:\s+|$)(.*)/s) {

https://github.com/theory/pod-simple/blob/b72a3a74bd7ba1a27ba397923f913a12f053e906/lib/Pod/Simple/BlackBox.pm#L243

So in a document that had no pod, the pod parser returned a bunch of binary
blobs
because it matched a very loose regexp, started the document, and then
found no actual pod (so basically everything afterwards is treated as a pod
paragraph).

I asked David about the inconsistency and he asked that I bring it up here.

Shouldn't the more strict regexp be used in both places?
On the first pass the parser marks the line as pod (presumably matching a
directive)
but on the second pass the line doesn't match any patterns and it all falls
through as a paragraph.

This inconsistency allows binary data to be treated as a pod document.
Is there a recommended way to parse the pod out of a document that might
have binary data in it?

Reply via email to