Hi, I've been playing around with src/kwset.c, and I've found a way to reduce the memory footprint when matching by a factor of 3 and to improve locality of reference: collect and sort all of the reversed keywords before starting on the trie build, allowing the trie layout to be planned in advance and reducing the number of pointers required.
This gives a speedup on some benchmarks with large numbers of keywords, for example matching 32-byte random hex keywords against random hex input is just over twice as fast in my tests, which use keyword counts between 100k and 1M. The setup time is also reduced by about a factor of 2. I've yet to find a benchmark in which the modified code is measurably slower than the original at match time. I have found one case where it takes about twice as long at prep time: 8M short keywords such that each adds a single trie node, arranged in an order that gives good locality of reference to the original trie building code but does not include any pre-sorted runs of reversed keywords. Proof of concept patches follow. All feedback welcome, particularly test/benchmark results on hardware other than i386. If some variant of this idea is judged suitable to go into gnu grep, then I expect the first step would be to add more tests for kwset. Nick
