commit ugrep-indexer for openSUSE:Factory

Source-Sync Sun, 13 Aug 2023 10:18:23 -0700

Script 'mail_helper' called by obssrc
Hello community,

here is the log from the commit of package ugrep-indexer for openSUSE:Factory 
checked in at 2023-08-13 19:17:53
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/ugrep-indexer (Old)
 and      /work/SRC/openSUSE:Factory/.ugrep-indexer.new.11712 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Package is "ugrep-indexer"

Sun Aug 13 19:17:53 2023 rev:2 rq:1103626 version:0.9.1

Changes:
--------
--- /work/SRC/openSUSE:Factory/ugrep-indexer/ugrep-indexer.changes      
2023-08-10 15:34:50.192514057 +0200
+++ /work/SRC/openSUSE:Factory/.ugrep-indexer.new.11712/ugrep-indexer.changes   
2023-08-13 19:18:07.268169346 +0200
@@ -1,0 +2,6 @@
+Sat Aug 12 18:53:19 UTC 2023 - Andreas Stieger <[email protected]>
+
+- update to 0.9.1:
+  * Adds an optional path parameter to index
+
+-------------------------------------------------------------------

Old:
----
  ugrep-indexer-0.9.tar.gz

New:
----
  ugrep-indexer-0.9.1.tar.gz

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ ugrep-indexer.spec ++++++
--- /var/tmp/diff_new_pack.IpQk3B/_old  2023-08-13 19:18:07.884173287 +0200
+++ /var/tmp/diff_new_pack.IpQk3B/_new  2023-08-13 19:18:07.888173312 +0200
@@ -17,7 +17,7 @@
 
 
 Name:           ugrep-indexer
-Version:        0.9
+Version:        0.9.1
 Release:        0
 Summary:        File indexer for accelerated search using ugrep
 License:        BSD-3-Clause

++++++ ugrep-indexer-0.9.tar.gz -> ugrep-indexer-0.9.1.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/ugrep-indexer-0.9/README.md 
new/ugrep-indexer-0.9.1/README.md
--- old/ugrep-indexer-0.9/README.md     2023-08-07 03:02:49.000000000 +0200
+++ new/ugrep-indexer-0.9.1/README.md   2023-08-12 20:21:41.000000000 +0200
@@ -1,58 +1,168 @@
-A file indexer to accelerate file searching
-===========================================
+A monotonic indexer to speed up grepping
+========================================
 
-The *ugrep-indexer* utility recursively indexes files to accelerate ugrep
-recursive searches with option `--index`.
+The *ugrep-indexer* utility recursively indexes files to speed up recursive
+grepping.
 
-[ugrep](https://github.com/Genivia/ugrep) is an ultra-fast file searcher that
-supports index-based searching as of v3.12.5.
+*Note: this is a 0.9 beta version of a new generation of "monotonic indexers".
+This release is subject to change and improvements based on experiments and
+user feedback.  Regardless, this implementation has been extensively tested for
+correctness.  Additional features and performance improvements are planned.*
+
+[ugrep](https://github.com/Genivia/ugrep) is a grep-compatible ultra fast file
+searcher that supports index-based searching as of v3.12.5.
+
+Index-based search can be significantly faster on slow file systems and when
+file system caching is ineffective: if the file system on a drive searched is
+not cached in RAM, i.e. it is "cold", then indexing will speed up search.
+It only searches those files that may match a specified regex pattern by using
+an index of the file.  This index allows for a quick check if there is a
+potential match, thus we avoid searching all files.
+
+Indexing should be safe and not skip updated files that may now match.  If any
+files and directories were changed after indexing, then searching will always
+search these additions and changes made to the file system by comparing file
+and directory time stamps.  If many files were added or changed, then we might
+want to re-index to bring the indexing up to date.  Re-indexing is incremental,
+so it will not take as much time as the initial indexing process.
 
-Indexing-based search makes sense if you're doing a recursive search on a lot
-of files.  Index-based searching is typically faster, except for pathelogical
-cases when searching a few files with patterns that match a lot (see Q&A
-below).  Index-based search is significantly faster on slow file systems or
-when file system caching is ineffective.
-
-A typical example of an index-based search:
+A typical example of an index-based search, e.g. on the ugrep v3.12.6
+repository placed on a separate drive:
 
     $ cd drive/ugrep
     $ ugrep-indexer -I
 
-    12245871 bytes scanned and indexed with 19% noise on average
-         1317 files indexed in 28 directories
-            0 new directories indexed
-         1317 new files indexed
-            0 modified files indexed
-            0 deleted files removed from indexes
-          128 binary files skipped with --ignore-binary
-            0 symbolic links skipped
-            0 devices skipped
-      5588843 bytes indexing storage increase at 4243 bytes/file
-
-Searching takes 1.07 seconds without indexing after unmounting the `drive` and
-mounting again to clear FS cache for a fair comparison:
+    12247077 bytes scanned and indexed with 19% noise on average
+        1317 files indexed in 28 directories
+          28 new directories indexed
+        1317 new files indexed
+           0 modified files indexed
+           0 deleted files removed from indexes
+         128 binary files skipped with --ignore-binary
+           0 symbolic links skipped
+           0 devices skipped
+     5605227 bytes indexing storage increase at 4256 bytes/file
+
+Normal searching on a cold file system without indexing takes 1.02 seconds
+after unmounting the `drive` and mounting again to clear FS cache to record the
+effect of indexing:
 
-    $ cd drive/ugrep
     $ ugrep -I -l 'std::chrono' --stats
     src/ugrep.cpp
 
-    Searched 1317 files in 28 directories in 1.07 seconds with 8 threads: 1 
matching (0.07593%)
+    Searched 1317 files in 28 directories in 1.02 seconds with 8 threads: 1 
matching (0.07593%)
 
-Searching takes only 0.109 seconds with indexing, which is 10 times faster,
-after unmounting `drive` and mounting again to clear FS cache for a fair
-comparison:
+Ripgrep 13.0.0 takes longer with 1.18 seconds for the same cold search (ripgrep
+skips binary files by default, so option `-I` is not specified):
+
+    $ time rg -l 'std::chrono'
+    src/ugrep.cpp
+        1.18 real         0.01 user         0.06 sys
+
+By contrast, with indexing, searching a cold file system only takes 0.0487
+seconds with ugrep, which is 21 times faster, after unmounting `drive` and
+mounting again to clear FS cache to record the effect of indexing:
 
-    $ cd drive/ugrep
     $ ugrep --index -I -l 'std::chrono' --stats
     src/ugrep.cpp
 
-    Searched 1317 files in 28 directories in 0.109 seconds with 8 threads: 1 
matching (0.07593%)
+    Searched 1317 files in 28 directories in 0.0487 seconds with 8 threads: 1 
matching (0.07593%)
     Skipped 1316 of 1317 files with indexes not matching any search patterns
 
-Index-based search is most effective when searching a lot of files and when our
-regex patterns aren't matching too much, i.e. we want to limit the use of
-unlimited repeats `*` and `+` and limit the use of Unicode character classes
-when possible.  This reduces the ugrep start-up time (see Q&A below).
+There is always some variance in the elapsed time with 0.0487 seconds the best
+time of four search runs that produced a search time range of 0.0487 (21x speed
+up) to 0.0983 seconds (10x speed up).
+
+The speed increase may be significantly higher in general compared to this
+small demo, depending on several factors, the size of the files indexed, the
+read speed of the file system and assuming most files are cold.
+
+The indexing algorithm that I designed is *provably monotonic*: a higher
+accuracy guarantees an increased search performance by reducing the false
+positive rate, but also increases index storage overhead.  Likewise, a lower
+accuracy decreases search performance, but also reduces the index storage
+overhead.  Therefore, I named my indexer a *monotonic indexer*.
+
+If file storage space is at a premium, then we can dial down the index storage
+overhead by specifying a lower indexing accuracy.
+
+Indexing the example from above with level 0 (option `-0`) reduces the indexing
+storage overhead by 8.6 times, from 4256 bytes per file to a measly 490 bytes
+per file:
+
+    12247077 bytes scanned and indexed with 42% noise on average
+        1317 files indexed in 28 directories
+           0 new directories indexed
+        1317 new files indexed
+           0 modified files indexed
+           0 deleted files removed from indexes
+         128 binary files skipped with --ignore-binary
+           0 symbolic links skipped
+           0 devices skipped
+      646123 bytes indexing storage increase at 490 bytes/file
+
+Indexed search is still a lot faster by 12x than non-indexed for this example,
+with 16 files actually searched (15 false positives):
+
+    Searched 1317 files in 28 directories in 0.0722 seconds with 8 threads: 1 
matching (0.07593%)
+    Skipped 1301 of 1317 files with indexes not matching any search patterns
+
+Regex patterns that are more complex than this example may have a higher false
+positive rate naturally, which is the rate of files that are considered
+possibly matching when they are not.  A higher false positive rate may reduce
+search speeds when the rate is large enough to be impactful.
+
+The following table shows how indexing accuracy affects indexing storage
+and the average noise per file indexed.  The rightmost columns show the search
+speed and false positive rate for `ugrep --index -I -l 'std::chrono'`:
+
+| acc. | index storage (KB) | average noise | false positives | search time 
(s) |
+| ---- | -----------------: | ------------: | --------------: | 
--------------: |
+| `-0` |                631 |           42% |              15 |          
0.0722 |
+| `-1` |               1276 |           39% |               1 |          
0.0506 |
+| `-2` |               1576 |           36% |               0 |          
0.0487 |
+| `-3` |               2692 |           31% |               0 |            
unch |
+| `-4` |               2966 |           28% |               0 |            
unch |
+| `-5` |               4953 |           23% |               0 |            
unch |
+| `-6` |               5474 |           19% |               0 |            
unch |
+| `-7` |               9513 |           15% |               0 |            
unch |
+| `-8` |              10889 |           11% |               0 |            
unch |
+| `-9` |              13388 |            7% |               0 |            
unch |
+
+If the specified regex matches many more possible patterns, for example with
+the search `ugrep --index -I -l '(todo|TODO)[: ]'`, then we may observe a
+higher rate of false positives among the 1317 files searched, resulting in
+slightly longer search times:
+
+| acc. | false positives | search time (s) |
+| ---- | --------------: | --------------: |
+| `-0` |             189 |           0.292 |
+| `-1` |              69 |           0.122 |
+| `-2` |              43 |           0.103 |
+| `-3` |              19 |           0.101 |
+| `-4` |              16 |           0.097 |
+| `-5` |               2 |           0.096 |
+| `-6` |               1 |            unch |
+| `-7` |               0 |            unch |
+| `-8` |               0 |            unch |
+| `-9` |               0 |            unch |
+
+Accucacy `-5` is the default, which tends to work well to search with regex
+patterns of modest complexity.
+
+One word of caution.  There is always a tiny bit of overhead to check the
+indexes.  This means that if all files are already cached in RAM, because files
+were searched or read recently, then indexing will not necesarily speed up
+search, obviously.  In that case a non-indexed search might be faster.
+Furthermore, an index-based search has a longer start-up time.  This start-up
+time increases when Unicode character classes and wildcards are used that must
+be converted to hash tables.
+
+To summarize, index-based search is most effective when searching a lot of
+cold files and when regex patterns aren't matching too much, i.e. we want to
+limit the use of unlimited repeats `*` and `+` and limit the use of Unicode
+character classes when possible.  This reduces the ugrep start-up time and
+limits the rate of false positive pattern matches (see Q&A below).
 
 Quick examples
 --------------
@@ -76,6 +186,11 @@
 
     ugrep-indexer -d
 
+Decrease index file storage to a minimum by decreasing indexing accuracy from 5
+(default) to 0:
+
+    ugrep-indexer -If0
+
 Increase search performance by increasing the indexing accuracy from 5
 (default) to 7 at a cost of larger index files:
 
@@ -92,6 +207,22 @@
 
     sudo make install
 
+Future enhancements
+-------------------
+
+- Index the contents of compressed files and archives to search them faster by
+  skipping non-matching archives.
+
+- Add an option to create one index file, e.g. specified explicitly to ugrep.
+  This could further improve indexed search speed if the index file is located
+  on a fast file system.  Otherwise, do not expect much improvement or even
+  possible slow down, since a single index file cannot be searched concurrently
+  and more index entries will be checked when in fact directories are skipped
+  (skipping their indexes too).  Experiments will tell.
+
+- Indexing tiny files might not be effective to speed up grepping.  This needs
+  further investigation.  The indexer could skip such tiny files for example.
+
 Q&A
 ---
 
@@ -101,8 +232,10 @@
 Files indexed are scanned (never changed!) by ugrep-indexer to generate index
 files.
 
-If any files or directories were updated, added or deleted after indexing, then
-you can run ugrep-indexer again.  This incrementally updates all indexes.
+The size of the index files depends on the specified accuracy, with `-0` the
+lowest (small index files) and `-9` the highest (large index files).  The
+default accuracy is `-5`.  See the next Q for details on the impact of accuracy
+on indexing size versus search speed.
 
 Indexing *never follows symbolic links to directories*, because symbolically
 linked directories may be located anywhere in a file system, or in another file
@@ -112,18 +245,27 @@
 Option `-v` (`--verbose`) displays the indexing progress and "noise" of each
 file indexed.  Noise is a measure of *entropy* or *randomness* in the input.  A
 higher level of noise means that indexing was less accurate in representing the
-contents of a file.  For example, a file with random data is hard to index
-accurately and will have a high level of noise.
+contents of a file.  For example, a large file with random data is hard to
+index accurately and will have a high level of noise.
 
-Indexing is not a fast process (ugrep-indexer 0.9 is not yet multi-threaded)
-and can take some time to complete.  When indexing completes, ugrep-indexer
-displays the results of indexing.  The total size of the indexes added and
-average indexing noise is also reported.
+The complexity of indexing is linear in the size of a given file to index.
+Practically, it is not a fast process though, not as fast a searching, and may
+take some time to complete a full indexing pass over a large directory tree.
+When indexing completes, ugrep-indexer displays the results of indexing.  The
+total size of the indexes added and average indexing noise is also reported.
+
+Scanning a file to index results in a 64KB index hashes table.  Then,
+ugrep-indexer halves the table with bit compression using bitwise-and as long
+as the target accuracy is not exceeded.  Halving is made possible by the fact
+that the table encodes hashes for 8 windows at offsets from the start of the
+pattern, corresponding to the 8 bits per index hashing table cell.  Combining
+the two halves of the table may flip some bits to zero from one, which may
+cause a false positive match.  This proves the monotonicity of the indexer.
 
 The ugrep-indexer understands "binary files", which can be skipped and not
 indexed with ugrep-indexer option `-I` (`--ignore-binary`).  This is useful
 when searching with ugrep option `-I` (`--ignore-binary`) to ignore binary
-files.
+files, which is a typical scenario.
 
 The ugrep-indexer also supports .gitignore files (and similar), specified with
 ugrep-indexer option `-X` (`--ignore-files`).  Ignored files and directories
@@ -146,11 +288,67 @@
 Option `-c` (`--count`) with `--index` automatically enables `--min-count=1` to
 skip all files with zero matches.
 
-Regex patterns are converted internally by ugrep with option `--index` to hash
-tables for up to the first 16 bytes of the regex patterns specified, possibly
-shorter in order to reduce construction time.  Therefore, first characters of a
-regex pattern to search are most critical to limit so-called false positive
-matches that will slow down searching.
+If any files or directories were updated, added or deleted after indexing, then
+ugrep `--index` will always search these when they are present on the recursive
+search path.  You can run ugrep-indexer again to incrementally updates all
+indexes.
+
+Regex patterns are converted internally by ugrep with option `--index` to a
+form of hash tables for up to the first 16 bytes of the regex patterns
+specified, possibly shorter in order to reduce construction time.  Therefore,
+the first characters of a regex pattern to search are most critical to limit
+so-called false positive matches that may slow down searching.
+
+More specifically, a regex pattern is converted to a DFA.  An indexing hash
+finite automaton (HFA) is constructed on top of the DFA to compactly represent
+hash tables as state transitions with labelled edges.  This HFA consists of up
+to eight layers, each shifted by one byte to represent the next 8-byte window
+over the pattern.  Each HFA layer encodes index hashes for that part of the
+pattern.  The index hash function chosen is "additive", meaning the next byte
+is added when hashed with the previous hash.  This is very important as it
+critically reduces the HFA construction overhead.  We can now encode labelled
+HFA transitions to states as multiple edges with 16-bit hash value ranges
+instead of a set of single edges each with an individual hash value.  To this
+end, I use my open-ended ranges library `reflex::ORanges<T>` derived from
+`std::set<T>`.
+
+A very simple single string `maybe_match()` function with the prime 61 index
+hash function is given below to demonstrate index-based searching of a single
+string:
+
+    // prime 61 hashing
+    uint16_t indexhash(uint16_t h, uint8_t b, size_t size)
+    {
+      return ((h << 6) - h - h - h + b) & (size - 1);
+    }
+
+    // return possible match of string given array of hashes of size <= 64K 
(power of two)
+    bool maybe_match(const char *string, uint8_t *hashes, size_t size)
+    {
+      size_t len = strlen(string); // practically we can and should limit len 
to e.g. 15 or 16
+      for (const char *window = string; len > 0; ++window, --len)
+      {
+        uint16_t h = window[0] & (size - 1);
+        if (hashes[h] & 0x01)
+          return false
+        size_t k, n = len < 8 ? len : 8;
+        for (k = 1; k < n; ++k)
+        {
+          h = indexhash(h, window[k], size);
+          if (hashes[h] & (1 << k))
+            return false;
+        }
+      }
+      return true;
+    }
+
+The prime 61 hash was chosen among many other possible hashing functions using
+a realistic experimental setup.  A candidate hashing function is tested by
+repreatedly searching a randomly-drawn word from a 100MB Wikipedia file that
+has one, two or three mutated characters.  The mutation is made to ensure it
+does not correspond to an actual valid word in the Wikipedia file.  Then the
+false positive rate is recorded when a mutated word matches the file.  A hash
+function with a minimal false positive rate should be a good candidate overall.
 
 ### Q: What is indexing accuracy?
 
@@ -160,7 +358,9 @@
 causes ugrep to sometimes search indexed files that do not match.  We call
 these "false positive matches".  Higher accuracy requires larger index files.
 Normally we expect 4K or less indexing storage per file on average.  The
-maximum is 64KB of index storage per file.
+minimum is 128 bytes of index storage per file, excluding the file name and
+a 4-byte index header.  The maximum is 64K bytes storage per file for very
+large noisy files.
 
 When searching indexed files, ugrep option `--stats` shows the search
 statistics after the indexing-based search completed.  When many files are not
@@ -178,3 +378,12 @@
 `*` and `+` repeats.  To find out how the start-up time increases, use option
 `ugrep --index -r PATTERN /dev/null --stats=vm` to search /dev/null with your
 PATTERN.
+
+### Q: Why are index files not compressed?
+
+Index files should be very dense in information content and that is the case
+with this new indexing algorithm for ugrep that I designed and implemented.
+The denser an index file is, the more compact it accurately represents the
+original file data.  That makes it hard or impossible to compress index files.
+This is also a good indicator of how effective an index file will be in
+practice.
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/ugrep-indexer-0.9/man/ugrep-indexer.1 
new/ugrep-indexer-0.9.1/man/ugrep-indexer.1
--- old/ugrep-indexer-0.9/man/ugrep-indexer.1   2023-08-07 03:02:49.000000000 
+0200
+++ new/ugrep-indexer-0.9.1/man/ugrep-indexer.1 2023-08-12 20:21:41.000000000 
+0200
@@ -1,4 +1,4 @@
-.TH UGREP-INDEXER "1" "August 06, 2023" "ugrep-indexer 0.9" "User Commands"
+.TH UGREP-INDEXER "1" "August 12, 2023" "ugrep-indexer 0.9.1" "User Commands"
 .SH NAME
 \fBugrep-indexer\fR -- file indexer for accelerated ugrep search
 .SH SYNOPSIS
@@ -8,6 +8,10 @@
 recursive searches with \fBugrep\fR option \fB--index\fR.
 .PP
 The following options are available:
+Usage:
+ugrep\-indexer [\fB\-0\fR|...|\fB\-9\fR] [\fB\-.\fR] 
[\fB\-c\fR|\fB\-d\fR|\fB\-f\fR] [\fB\-I\fR] [\fB\-q\fR] [\fB\-S\fR] [\fB\-s\fR] 
[\fB\-X\fR] [\fB\-z\fR] [\fIPATH\fR]
+.TP
+PATH    Optional pathname to the root of the directory tree to index.
 .TP
 \fB\-0\fR, \fB\-1\fR, \fB\-2\fR, \fB\-3\fR, ..., \fB\-9\fR, 
\fB\-\-accuracy\fR=\fIDIGIT\fR
 Specifies indexing accuracy.  A low accuracy reduces the indexing
@@ -59,7 +63,7 @@
 \fB\-z\fR, \fB\-\-decompress\fR
 Index the contents of compressed files and archives.
 This option is not yet available in this version.
-ugrep\-indexer 0.9 beta
+ugrep\-indexer 0.9.1 beta
 License BSD\-3\-Clause: <https://opensource.org/licenses/BSD\-3\-Clause>
 Written by Robert van Engelen and others: <https://github.com/Genivia/ugrep>
 .SH "EXIT STATUS"
@@ -88,6 +92,11 @@
 .IP
 $ ugrep-indexer -d
 .PP
+Decrease index file storage to a minimum by decreasing indexing accuracy from 5
+(default) to 0:
+.IP
+$ ugrep-indexer -If0
+.PP
 Increase search performance by increasing the indexing accuracy from 5
 (default) to 7 at a cost of larger index files:
 .IP
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/ugrep-indexer-0.9/man.sh 
new/ugrep-indexer-0.9.1/man.sh
--- old/ugrep-indexer-0.9/man.sh        2023-08-07 03:02:49.000000000 +0200
+++ new/ugrep-indexer-0.9.1/man.sh      2023-08-12 20:21:41.000000000 +0200
@@ -77,6 +77,11 @@
 .IP
 $ ugrep-indexer -d
 .PP
+Decrease index file storage to a minimum by decreasing indexing accuracy from 5
+(default) to 0:
+.IP
+$ ugrep-indexer -If0
+.PP
 Increase search performance by increasing the indexing accuracy from 5
 (default) to 7 at a cost of larger index files:
 .IP
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/ugrep-indexer-0.9/src/ugrep-indexer.cpp 
new/ugrep-indexer-0.9.1/src/ugrep-indexer.cpp
--- old/ugrep-indexer-0.9/src/ugrep-indexer.cpp 2023-08-07 03:02:49.000000000 
+0200
+++ new/ugrep-indexer-0.9.1/src/ugrep-indexer.cpp       2023-08-12 
20:21:41.000000000 +0200
@@ -34,7 +34,7 @@
 @copyright (c) BSD-3 License - see LICENSE.txt
 */
 
-#define UGREP_INDEXER_VERSION "0.9 beta"
+#define UGREP_INDEXER_VERSION "0.9.1 beta"
 
 // check if we are compiling for a windows OS, but not Cygwin or MinGW
 #if (defined(__WIN32__) || defined(_WIN32) || defined(WIN32) || 
defined(__BORLANDC__)) && !defined(__CYGWIN__) && !defined(__MINGW32__) && 
!defined(__MINGW64__)
@@ -104,14 +104,23 @@
 #include <vector>
 #include <stack>
 
+// number of bytes to gulp into the buffer to index a file
 #define BUF_SIZE 65536
+
+// smallest possible power-of-two size of an index of a file, shoud be > 61
 #define MIN_SIZE 128
 
+// default --ignore-files=FILE argument
 #define DEFAULT_IGNORE_FILE ".gitignore"
 
+// fixed constant strings
 const char ugrep_index_filename[] = "._UG#_Store";
 const char ugrep_index_file_magic[5] = "UG#\x03";
 
+// command-line optional PATH argument
+const char *arg_pathname = NULL;
+
+// command-line options
 int flag_accuracy = 6;
 bool flag_check = false;
 bool flag_decompress = false;
@@ -131,20 +140,26 @@
   std::vector<std::string> dirs;
 };
 
-// stack of ignore files/dirs
+// stack of ignore file/dir globs per ignore-file found
 std::stack<Ignore> ignore_stack;
 
 // entry data extracted from directory contents, moves pathname to this entry
 struct Entry {
 
+  // indexing is initiated with the pathname to the root of the directory to 
index
   Entry(const char *pathname = ".")
     :
       pathname(pathname), // the working dir by default
       base(0),
       mtime(~0ULL), // max time to make sure we check the working directory 
for updates
       size(0)
-  { }
+  {
+    const char *sep = strrchr(pathname, PATHSEPCHR);
+    if (sep != NULL)
+      base = strlen(sep) - 1;
+  }
 
+  // new pathname entry, note this moves the pathname to the entry that owns 
it now
   Entry(std::string& pathname, size_t base, uint64_t mtime, off_t size)
     :
       pathname(std::move(pathname)),
@@ -175,7 +190,7 @@
   }
 
   std::string pathname; // full pathname
-  size_t      base;     // size of the basename in the pathname
+  size_t      base;     // length of the basename in the pathname
   uint64_t    mtime;    // modification time
   off_t       size;     // file size
 
@@ -193,7 +208,8 @@
 // display a help message and exit
 void help()
 {
-  std::cout << "Usage: ugrep-indexer [-0|...|-9] [-.] [-c|-d|-f] [-I] [-q] 
[-S] [-s] [-X] [-z]\n\n\
+  std::cout << "\nUsage:\n\nugrep-indexer [-0|...|-9] [-.] [-c|-d|-f] [-I] 
[-q] [-S] [-s] [-X] [-z] [PATH]\n\n\
+    PATH    Optional pathname to the root of the directory tree to index.\n\n\
     -0, -1, -2, -3, ..., -9, --accuracy=DIGIT\n\
             Specifies indexing accuracy.  A low accuracy reduces the 
indexing\n\
             storage overhead at the cost of a higher rate of false positive\n\
@@ -392,6 +408,14 @@
         }
       }
     }
+    else if (arg_pathname == NULL)
+    {
+      arg_pathname = arg;
+    }
+    else
+    {
+      usage("argument PATH already specified as ", arg_pathname);
+    }
   }
 
   if (flag_check)
@@ -408,7 +432,6 @@
 #if defined(HAVE_F_RDAHEAD)
   if (strchr(mode, 'a') == NULL && strchr(mode, 'w') == NULL)
   {
-    // removed O_NOATIME which may fail
 #if defined(O_NOCTTY)
     int fd = open(filename, O_RDONLY | O_NOCTTY);
 #else
@@ -594,7 +617,7 @@
 
     half_noise /= 8 * half;
 
-    // stop at accuracy 0 -> 70% and 9 -> 10% default 5 -> 36.7% (4 -> 43.3%, 
6 -> 30%)
+    // stop at accuracy 0 -> 80% and 9 -> 10% default 5 -> 41.1% (4 -> 48.9%, 
6 -> 33%)
     if (100.0 * half_noise >= 10.0 + 70.0 * (9 - flag_accuracy) / 9.0)
       break;
 
@@ -874,7 +897,7 @@
 }
 
 // recursively delete index files
-void deleter()
+void deleter(const char *pathname)
 {
   flag_no_messages = true;
 
@@ -891,7 +914,11 @@
   uint64_t index_time;
   uint64_t last_time;
 
-  dir_entries.emplace();
+  // pathname to the directory tree to index or .
+  if (pathname == NULL)
+    dir_entries.emplace();
+  else
+    dir_entries.emplace(pathname);
 
   // recurse subdirectories breadth-first to remove index files
   while (!dir_entries.empty())
@@ -901,6 +928,7 @@
 
     cat(visit.pathname, dir_entries, file_entries, num_dirs, num_links, 
num_other, ign_dirs, ign_files, index_time, last_time, true);
 
+    // if index time is nonzero, there is a valid index file in this directory 
we should remove
     if (index_time > 0)
     {
       
index_filename.assign(visit.pathname).append(PATHSEPSTR).append(ugrep_index_filename);
@@ -910,7 +938,7 @@
 }
 
 // recursively index files
-void indexer()
+void indexer(const char *pathname)
 {
   std::stack<Entry> dir_entries;
   std::vector<Entry> file_entries;
@@ -933,7 +961,11 @@
   float sum_noise = 0;
   uint8_t hashes[65536];
 
-  dir_entries.emplace();
+  // pathname to the directory tree to index or .
+  if (pathname == NULL)
+    dir_entries.emplace();
+  else
+    dir_entries.emplace(pathname);
 
   // recurse subdirectories
   while (!dir_entries.empty())
@@ -1209,9 +1241,9 @@
   options(argc, argv);
 
   if (flag_delete)
-    deleter();
+    deleter(arg_pathname);
   else
-    indexer();
+    indexer(arg_pathname);
 
   return EXIT_SUCCESS;
 }

commit ugrep-indexer for openSUSE:Factory

Reply via email to