On Mon, Sep 14, 2020 at 11:05:21PM +0200, Thomas Tempelmann via Pcre-dev wrote: > I have uploaded a rather short binary file here: > http://files.tempel.org/tmp/pcre2_subject_sample. > > If you use my sample code to search for "AWAVAUATSH", it won't find it with > PCRE2_MATCH_INVALID_UTF but will find it with the PCRE2_UTF option (no > crash in this small file, though). > So, whenever I use PCRE2_MATCH_INVALID_UTF, text won't be found at all in > binary files, it seems. That contradicts the docs and Philip's suggestion, > though. > > What am I doing wrong? > You need PCRE2_MATCH_INVALID_UTF. That's a way of matching in a binary data, that the Mach-O binary is, when you need handling Unicode.
pcre2grep invocation closest to your code looks like this: $ pcre2grep -l --binary-files=binary -i -U AWAVAUATSH pcre2_subject_sample Binary file pcre2_subject_sample matches I looked at your code, compared it to pcre2grep, made your code similar to pcre2grep, and I found these two differences: (1) pcre2grep does not invoke pcre2_match_data_create_from_pattern_8(). (2) pcre2grep searches the file info more smaller lumps: 7901, 236, 3, 1917, 139. That's 156 bytes less than the size of the pcre2_subject_sample file. It looks like a bug in pcre2grep. Unless it's a some kind of a smart optimitization. I also corrected your code to search the for the the file-lenght. Not the whole heap-allocated 32 MB whose trailnig part could be uninitialized with a binary garbage. But I found what triggers the misbehaviour of your code. It's the PCRE2_CASELESS option. Without it it work's like a charm. For your information, this how I changed your code: // // main.c // PCRE2_Binary_Search // // Created by Thomas Tempelmann on 14.09.20. // #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #define PCRE2_CODE_UNIT_WIDTH 8 #include <pcre2.h> int main(int argc, const char * argv[]) { const char *find = "AWAVAUATSH"; uint32_t regexOptions = PCRE2_UTF | PCRE2_CASELESS | PCRE2_MATCH_INVALID_UTF; uint32_t matchOptions = 0; int errNum = 0; PCRE2_SIZE errOfs = 0; pcre2_code *regEx2 = pcre2_compile_8 ((PCRE2_SPTR)find, 10/*PCRE2_ZERO_TERMINATED*/, regexOptions, &errNum, &errOfs, NULL); if (!regEx2) { printf("pcre2_compile_8() failed.\n"); return 1; } pcre2_match_data *regEx2Match = pcre2_match_data_create_from_pattern_8 (regEx2, NULL); if (!regEx2Match) { printf("pcre2_match_data_create_from_pattern() failed.\n"); return 1; } size_t dataLen = 32 * 1024 * 1024; // 32 MB void *dataPtr = malloc (dataLen); if (!dataPtr) { printf("malloc() failed.\n"); return 1; } int fd = open ("/tmp/pcre2_subject_sample", O_RDONLY); if (fd == -1) { printf("open() failed.\n"); return 1; } ssize_t dataRead = read (fd, dataPtr, dataLen); if (dataRead != 10352) { printf("read() did not return 10352 bytes.\n"); return 1; } errNum = pcre2_match_8 (regEx2, (PCRE2_SPTR8)dataPtr, (PCRE2_SIZE)dataRead, 0, matchOptions, regEx2Match, NULL); if (errNum >= 1) { printf("A match found.\n"); } else if (errNum == 0) { printf("A matchblock is too small.\n"); } else { printf("No match found.\n"); } return 0; } -- Petr
signature.asc
Description: PGP signature
-- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev