On Tue, 14 Oct 2025 21:53:41 GMT, Daniel Hu <[email protected]> wrote:

>> These changes should prevent entire binary files from being loaded into 
>> memory for build/AbsPathsInImage.java test. I chose a default buffer size of 
>> 8KB since BufferedInputStream uses that, but open to alternative solutions. 
>> GHA passes and test passes on linux x64.
>
> Daniel Hu has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - fix incorrect use of inputstream
>  - remove extraneous variables/imports

test/jdk/build/AbsPathsInImage.java line 110:

> 108:         }
> 109: 
> 110:         List<byte[]> searchPatterns = new ArrayList<>();

Moved `searchPatterns` to a class variable so it's "global" and doesn't need to 
be passed around (removed it from the various function headers accordingly)

test/jdk/build/AbsPathsInImage.java line 187:

> 185:             }
> 186:         }
> 187:     }

This is an implementation of the [Knuth–Morris–Pratt 
algorithm](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm).
 I waffled between choosing various string search algorithms, and ultimately 
settled on this one as it's well suited for this application: the algorithm 
doesn't skip or backtrack the text being searched and only relies looking at 
one char at a time. Essentially, the algorithm just creates state machines 
tailored specifically for string searching; upon each inputted char, an update 
is made to the searched string's DFA. A related algorithm I looked at that's 
technically better suited is 
[Aho–Corasick](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm)? 
But (to me) it looked significantly more complex (probably requires custom node 
objects and might have a higher space requirement due to needing to store 
references to other nodes per node). Of course, I'm not intimately familiar 
with these algorithms, so I very might be missing cert
 ain details.

`createPrefixTables` function is the pre-processing done on the string patterns 
that creates the "state machines". Basically, it just creates a table of all 
the available prefixes that the string can jump back to at any current index. 
`getPrefixIndex` is used both in `createPrefixTables` (the preprocessing) and 
the actual string matching process itself (in `scanBytes`). It's the failure 
function or it finds the appropriate index/state/prefix to go back to when a 
mismatch is found (it can be used for the preprocessing since the preprocessing 
is simply finding an index/prefix to jump back to at each index).

test/jdk/build/AbsPathsInImage.java line 281:

> 279:                         break;
> 280:                     }
> 281:                 }

Basically, we go through every search pattern per inputstream (file) byte and 
update the corresponding states accordingly. If we encounter a match, we add it 
to a list of matches.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26030#discussion_r2452852549
PR Review Comment: https://git.openjdk.org/jdk/pull/26030#discussion_r2452848960
PR Review Comment: https://git.openjdk.org/jdk/pull/26030#discussion_r2452860759

Reply via email to