Re: fstrcmp

2009-06-07 Thread Peter Miller
On Tue, 2009-06-02 at 10:28 -0400, Michael Poole wrote:
 I have some relevant experience that makes me skeptical about needing
 sophisticated structures to achieve acceptable performance

Some concrete numbers:

- naive fstrcmp comparisons (using edit distance, lifted from GNU
Gettext)
- list of names obtained from apt-cache dumpavail
- 2.1GHz x86

- looking for dnsutl

0.08008 = 0.215 sec / 26866 pair
that's 8.0us per fstrcmp call

- looking for dns-server-utilities

0.15816 = 0.425 sec / 26866 pair
that's 15us per fstrcmp call

For me, adding ~0.3 second to the error case in order to provide a far
better error message would appear to be worth it, especially as this is
less than the existing 0.532s reported by time sudo apt-get install
dnsutl (not including I/O time) on the same system.

I/O time when none of the package data is in cache is  9 seconds,
confirming what Martin has been saying... and ~30 times longer than the
fstrcmp overhead.


I will have version 0.1 ready shortly.


Regards
Peter Miller mill...@opensource.org.au
/\/\*http://miller.emu.id.au/pmiller/

PGP public key ID: 1024D/D0EDB64D
fingerprint = AD0A C5DF C426 4F03 5D53  2BDB 18D8 A4E2 D0ED B64D
See http://www.keyserver.net or any PGP keyserver for public key.

A computer is like air conditioning: it becomes useless when you open
windows. -- Linus Torvalds


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: fstrcmp

2009-06-02 Thread Jérôme Pouiller
On Sunday 31 May 2009 03:49:37 Peter Miller wrote:
[...]
This goes for packages as well.  Wouldn't it be great if

apt-get install dns-utils

instead of saying

E: Couldn't find package dns-utils

it said something more useful, like

E: Couldn't find package dns-utils, did you mean dnsutils instead?

It is naive to think matching algorithm iterates on all items until it 
find the correct one. At least, algorithm use a sorted index with a 
dichotomy search. 

Nevertheless, your idea is interesting. But you should implement a 
function to match the nearest string in a set of strings. Take a look in 
spell checking libraries to have an idea how to implement it.


-- 
Jérôme Pouiller (jezz AT sysmic DOT org)


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: fstrcmp

2009-06-02 Thread Michael Poole
Jérôme Pouiller writes:

 On Sunday 31 May 2009 03:49:37 Peter Miller wrote:
 [...]
This goes for packages as well.  Wouldn't it be great if

apt-get install dns-utils

instead of saying

E: Couldn't find package dns-utils

it said something more useful, like

E: Couldn't find package dns-utils, did you mean dnsutils instead?

 It is naive to think matching algorithm iterates on all items until it 
 find the correct one. At least, algorithm use a sorted index with a 
 dichotomy search. 

 Nevertheless, your idea is interesting. But you should implement a 
 function to match the nearest string in a set of strings. Take a look in 
 spell checking libraries to have an idea how to implement it.

Isn't this optimization premature?  I would say: Package the library,
implement the fuzzy matching, and if it is too slow for people to like
the case where they misspell a package name, *then* optimize for
run-time.  I would rather have the fuzzy matching sooner than have it
shave a few milliseconds off the display time for a correction.

Michael Poole


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: fstrcmp

2009-06-02 Thread Jérôme Pouiller
On Tuesday 02 June 2009 12:32:47 Michael Poole wrote:
 Jérôme Pouiller writes:
[...]
  It is naive to think matching algorithm iterates on all items until
  it find the correct one. At least, algorithm use a sorted index
  with a dichotomy search.
 
  Nevertheless, your idea is interesting. But you should implement a
  function to match the nearest string in a set of strings. Take a
  look in spell checking libraries to have an idea how to implement
  it.

 Isn't this optimization premature?  I would say: Package the library,
 implement the fuzzy matching, and if it is too slow for people to
 like the case where they misspell a package name, *then* optimize for
 run-time.  I would rather have the fuzzy matching sooner than have it
 shave a few milliseconds off the display time for a correction.

In another thread, Adeodato Simó wrote:
 I can't see how it'd work here, at least without the help of some
 on-disk structure, since we're talking about a space of 25,000
 packages.

Naive search of matching string under a set of 25,000 strings is 
something like 2000 times slower (maybe far more) than a correct 
algorithm. Adeodato thinks it is not usable. I said we shouldn't reject 
idea because of performances and I suggested a better algorithm should 
be used.


-- 
Jérôme Pouiller (jezz AT sysmic DOT org)


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: fstrcmp

2009-06-02 Thread Daniel Burrows
On Tue, Jun 02, 2009 at 02:32:06PM +0200, Jérôme Pouiller j...@sysmic.org was 
heard to say:
 In another thread, Adeodato Simó wrote:
  I can't see how it'd work here, at least without the help of some
  on-disk structure, since we're talking about a space of 25,000
  packages.
 
 Naive search of matching string under a set of 25,000 strings is 
 something like 2000 times slower (maybe far more) than a correct 
 algorithm. Adeodato thinks it is not usable. I said we shouldn't reject 
 idea because of performances and I suggested a better algorithm should 
 be used.

  aptitude already scans every package regularly.  For instance, when
you ask it to install something that doesn't exist, it checks every
package name using a naive substring algorithm.  According to time,
this takes 1.5 to 1.7 seconds on my computer, most of which appears
to be spent initializing the program.

  Fuzzy matches are more expensive, of course.  It might even be
interesting to just do an edit distance computation up to k=1 and see
how that performed, though; I bet it would be reasonable with some mild
optimization.

  Daniel


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: fstrcmp

2009-06-02 Thread Michael Poole
Jérôme Pouiller writes:

 In another thread, Adeodato Simó wrote:
 I can't see how it'd work here, at least without the help of some
 on-disk structure, since we're talking about a space of 25,000
 packages.

 Naive search of matching string under a set of 25,000 strings is 
 something like 2000 times slower (maybe far more) than a correct 
 algorithm. Adeodato thinks it is not usable. I said we shouldn't reject 
 idea because of performances and I suggested a better algorithm should 
 be used.

If we are pulling numbers out of the air, reading data from disk is
something like 2000 times slower (maybe far more) than doing string
comparisons.  I/O may dominate the running time even for a naive
comparison.

I have some relevant experience that makes me skeptical about needing
sophisticated structures to achieve acceptable performance: Ten years
ago, I wrote a simple spell checker as part of a job interview.  It
was able to suggest (ranked) close matches for misspelled words from a
~100,000-word dictionary within a few milliseconds.  Given that I had
less than an hour's notice about what I would have to write, and only
about 4 hours to write and test the program, I would be very surprised
if performance for any reasonable algorithm on more modern machines
will be bad enough to make it unusable.

It is not as if the application code should require significant
rewriting based on whether it performs an element-wise comparison or
whether it passes some function an array of strings against which to
match.  The library code is what would need much deeper changes.  So
why not see how well the existing library works before demanding that
it be radically rewritten?

If you want to implement a superior algorithm, no one will stop you.
If the fuzzy match only runs when no exact match is found, it suffers
no slowdown on success and may be fast enough when there is incorrect
input.  If there are people who think it is unacceptably slow, let
them come forward with actual numbers in hand.  Insisting on much more
development work because of an unsubstantiated concern seems like a
poor trade-off.

Michael


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: fstrcmp

2009-05-31 Thread Florian Weimer
* Peter Miller:

 I've been considering turning my fuzzy string compare function into a
 library.

I would certainly welcome that.

Would you be willing to relicense it under a more permissive license,
so that we don't have to worry about OpenSSL license compatibility
etc.?

 /**
  * the fstrcmp function compare two strings, to determine how
  * similar two strings appear.
  *
  * @param s1
  * The first of the strings to compare.
  * @param s2
  * The second of the strings to compare.
  * @returns
  * a number between 0.0 and 1.0; 0.0 means the strings are
  * nothing alike, 1.0 means the two strings are identical.
  */
 double fstrcmp(const char *s1, const char *s2);

It could be helpful if it didn't use floating point because we support
some systems where floating point is software-emulated.

I don't think we've got a library of C goodies.  libbsd is something
in that direction, but if your function isn't in the BSDs, it probably
doesn't fit there, either.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: fstrcmp

2009-05-31 Thread Adeodato Simó
+ Peter Miller (Sun, 31 May 2009 11:49:37 +1000):

 Wouldn't it be great if when you typed
 apt-get build-deps gcc

 instead of saying
 E: Invalid operation build-deps

 it said something more useful, like
 E: Invalid operation build-deps, did you mean build-dep instead?

I guess that's would be doable, since it'd involve calculating distances
to all registered commands, which are only a handful. However:

 This goes for packages as well.  Wouldn't it be great if
 apt-get install dns-utils

 instead of saying
 E: Couldn't find package dns-utils

 it said something more useful, like
 E: Couldn't find package dns-utils, did you mean dnsutils instead?

I can't see how it'd work here, at least without the help of some
on-disk structure, since we're talking about a space of 25,000 packages.

-- 
- Are you sure we're good?
- Always.
-- Rory and Lorelai


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: fstrcmp

2009-05-31 Thread William Pitcock
On Sun, 2009-05-31 at 11:04 +0200, Florian Weimer wrote:
 * Peter Miller:
 
  I've been considering turning my fuzzy string compare function into a
  library.
 
 I would certainly welcome that.
 
 Would you be willing to relicense it under a more permissive license,
 so that we don't have to worry about OpenSSL license compatibility
 etc.?
 
  /**
   * the fstrcmp function compare two strings, to determine how
   * similar two strings appear.
   *
   * @param s1
   * The first of the strings to compare.
   * @param s2
   * The second of the strings to compare.
   * @returns
   * a number between 0.0 and 1.0; 0.0 means the strings are
   * nothing alike, 1.0 means the two strings are identical.
   */
  double fstrcmp(const char *s1, const char *s2);
 
 It could be helpful if it didn't use floating point because we support
 some systems where floating point is software-emulated.
 
 I don't think we've got a library of C goodies.  libbsd is something
 in that direction, but if your function isn't in the BSDs, it probably
 doesn't fit there, either.

There is libmowgli (don't ask why it's named this, we just picked
something random), but it requires all core modules to be licensed under
ISC.  There is some Windows portability components that are LGPL.

Unfortunately, we're renovating the atheme.org website right now, so
instructions to submit additional components are not yet available.

William


signature.asc
Description: This is a digitally signed message part