| As I understand it, the tune finder does the following for every tune:
|
| 1. Extracts it from the source file
| 2. puts it into a canonical form:
|      a. strips leading and trailing blank lines
|      b. changes the line endings to a standard form
|      c. (dunno about this one) strips trailing white-space from lines

All correct.

| 3. Stores a copy of the canonical form in a database,

No, it doesn't to this.  In fact, some people have already  suggested
that they would treat this as a copyright violation.  Of course, they
must also be considering  lawsuits  against  google.com  and  several
other search sites.

But my main reason is that I'm already one of  the  larger  users  of
disk  space on this MIT EE Dept machine where I have a guest account.
This would at least double the amount of space that I use.   I  don't
think I want to wear out my welcome like that. I mean, the account is
free (as in beer).

If a good reason to store all the tunes this way were to  arise,  I'd
first  want to have a serious discussion with the admins and see what
they thought of the idea. In the past, they have been very supportive
of  people  doing  interesting  things on these machines, and several
other users are real disk hogs, putting up videos on their web sites.
They've been very supportive of my work, to the point of asking me to
be the first tester of assorted upgrades and giving me permission  to
tweak the web server.  Still, ...

I do have it all on my home machine (and on one of Toby's  machines),
and  maybe  I'll  try some ideas there.  At home, there's the problem
that the local ISP (rcn.com) has been blocking port 80  for  about  a
year,  and  probably  won't  open  it  up without either a lawsuit or
effective political pressure. I do have a server on another port, but
that's  known  to  be  an  effective  barrier for a large part of the
online user population.  And my home machine's web  stuff  is  flakey
because I'm always trying new things out there.

(A  couple  years  ago,  I  had  access  through  mediaone.com,   now
attbi.com, but I ran my search bot a couple of times, and they kicked
me off for "hacking".  I guess  developing  web  search  software  is
something that's permitted to companies but not to mere humans.  ;-)

| 4. Stores index info (title, key, meter, author, etc) in the database as well
| 5. When asked, retrieves the tune and converts it to other formats.

These are correct.

| Would it be possible, or make sense, to also store a hash (like a CRC-32 or
| an MD5 signature) of the canonical tune in the database, and only show
| unique hashes on a query?

Possibly, but the chances are that  this  would  mostly  exclude  the
primary source of a tune and show only a mirror.  And when the mirror
is down, there would be no way to find the others.

| I think that would remove a large number of spurious duplicate results from
| the query results.   It wouldn't remove the work of different
| transcriptionists, or variants in a tune, but it would remove duplications
| caused by multiple copies of the same transcription floating around the web.

Well, I personally like seeing the spurious duplicate  results.   The
web  can  be sufficiently flakey at times that it's useful to be able
to kill the download that's hung and go on to the next link.

Maybe some day, when all the ISPs (and backbones) in  the  Net  start
providing good, reliable IP connectivity, this won't be important.  I
don't expect to live to see that day.   At  present,  the  motion  is
mostly in the other direction.

A current computer industry news story is that Yahoo has  entered  an
agreement  with  the  Chinese government to supply filtering software
and support for the  "Public  Pledge  on  Self-discipline  for  China
Internet Industry".  This isn't a good sign for Net reliability.

If you want to see an especially funny story about such things, try:
  http://yro.slashdot.org/article.pl?sid=02/07/15/2039227&mode=thread&tid=153

When the commercial world is involved in such  perversity,  the  most
sensible  thing  is  to  encourage mirroring and redundancy in online
information.

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html

Reply via email to