Hi Philip,

> This is just a quick reply (sorry, out of time to reply in detail). You are 
> indeed not doing this quite right. Search the PCRE2 documents for             
>                                                     
>   PCRE2_MATCH_INVALID_UTF                                         
>                                  
> This option forces PCRE2_UTF and also enables support for matching by 
> pcre2_match() in subject strings that contain invalid UTF sequences.

This doesn't seem to work for me, though:

When I add PCRE2_MATCH_INVALID_UTF to the options for pcre2_compile(), the 
crash goes away but I won't find the pattern in my sample binary file; it is 
found in a plain text file, though.

OTOH, if I remove both PCRE2_MATCH_INVALID_UTF and PCRE2_UTF from the options, 
then I get no crash and successfully find the pattern in my sample binary file 
(not the locate db).

The binary file is a macOS library file (CoreFoundation), and the pattern is a 
plain ASCII name (all letters) that appears as a symbol in it. It's part of the 
symbol table, where lots of plain ASCII names are separated by single 00 bytes. 
Nothing of that is invalid UTF, and even if PCRE2 would consider a 00-byte 
invalid, it should restart at the next byte, which is eventually the first byte 
of the pattern to find. So I don't see why this wouldn't work with the 
PCRE2_MATCH_INVALID_UTF option.

The explanation for PCRE2_MATCH_INVALID_UTF in the docs (pcre2unicode.html) 
makes sense to me in the way how it deals with invalid UTF sequences in the 
binary data. So, ideally, I'd like to follow your suggestion, but it would mean 
that the search for plain ASCII text in my sample binary file would fail, and 
that's not great.

Also, when I keep doing this without PCRE2_MATCH_INVALID_UTF and PCRE2_UTF, 
will I still run the risk of getting crashes and related issues? So far, I seem 
to get the crashes only if I use PCRE2_UTF. I understand that I won't be able 
to find non-ASCII UTF text, then.

When I run this cmd, it finds the pattern inside the binary file:

  pcre2grep -al 'NSURLVolumeNameKey' CoreFoundation

So, if that works, what do I wrong, then? Or is pcre2grep not using the 
PCRE2_UTF option?

Thomas


-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to