Re: +Idx problems maybe?

Alexander Burger Mon, 02 Nov 2009 23:46:15 -0800

Hi Henrik,

> I took a look at the pilog file, I already get what same and range are
> doing but what are part, head and fold doing?


You are on the right track. You used 'tolr', but this actually makes
sense only in combination with the '+Sn' (Soundex) prefix. The whole
matter is rather complicated, because there are so many combinations of
index types and Pilog comparison functions possible.


I would say that we have the following typical use cases for string
searches (I'll leave out numerical searches, which usually combine with
'same' or 'range').

1. "Exact" searches. You have either a unique index

      (rel key (+Key +String))

   or a non-unique index

      (rel key (+Ref +String))

   and you can compare results in Pilog with

      (same @Str @Cls key)

   for exact matches, or with

      (head @Str @Cls key)

   for "dictionary" searches (searching only for the beginning of
   strings). These are case-sensitive searches.


2. "Folded" searches. They make use of the 'fold' function which keeps
   only letters, converted to lower case, and digits.

      (rel key (+Fold +Ref +String))
      ...
      (fold @Str @Cls key)

   This searches only for the beginning of strings. We use it typically
   for telephone numbers.


   If a search for individual words in a key is desired, we can use

      (rel key (+List +Fold +Ref +String))
      ...
      (fold @Str @Cls key)

   This stores only the strings in the list (not the substrings) in
   'fold'ed representation. So each word can be found by "dictionary"
   search. This requires changes to the GUI and import functions,
   though, as 'key' is not a string but a list of strings.


   Finally, we can also index folded substrings:

      (rel key (+Fold +Idx +String))
      ...
      (part @Str @Cls key)

   This is perhaps what you need. If you go for it, I'd recommend you
   download once more the latest testing release, as the 'part' function
   was changed recently.


3. "Tolerant" searches. They return first all exact (case-sensitive)
   matches of partial strings, and then the matches according to the
   soundex algorithm (the first letter is compared exactly
   (case-sensitive), the rest checks for similarity). This makes mainly
   sense for personal names.

      (rel key (+Sn +Idx +String))
      ...
      (tolr @Str @Cls key)


Concerning space consumption, the '+Key' and '+Ref' indexes are the most
economical ones. They create only a single entry in the index tree per
key.

Then follow the '+List +Ref +String' indexes, which create an entry per
word.

Most space-hungry are the '+Idx' indexes, as they create an entry for
each substring down to a length of three, and '+Sn' adds one more for
the soundex key.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:[email protected]?subject=unsubscribe

Re: +Idx problems maybe?

Reply via email to