| As I understand it, the tune finder does the following for every tune: | | 1. Extracts it from the source file | 2. puts it into a canonical form: | a. strips leading and trailing blank lines | b. changes the line endings to a standard form | c. (dunno about this one) strips trailing white-space from lines
All correct. | 3. Stores a copy of the canonical form in a database, No, it doesn't to this. In fact, some people have already suggested that they would treat this as a copyright violation. Of course, they must also be considering lawsuits against google.com and several other search sites. But my main reason is that I'm already one of the larger users of disk space on this MIT EE Dept machine where I have a guest account. This would at least double the amount of space that I use. I don't think I want to wear out my welcome like that. I mean, the account is free (as in beer). If a good reason to store all the tunes this way were to arise, I'd first want to have a serious discussion with the admins and see what they thought of the idea. In the past, they have been very supportive of people doing interesting things on these machines, and several other users are real disk hogs, putting up videos on their web sites. They've been very supportive of my work, to the point of asking me to be the first tester of assorted upgrades and giving me permission to tweak the web server. Still, ... I do have it all on my home machine (and on one of Toby's machines), and maybe I'll try some ideas there. At home, there's the problem that the local ISP (rcn.com) has been blocking port 80 for about a year, and probably won't open it up without either a lawsuit or effective political pressure. I do have a server on another port, but that's known to be an effective barrier for a large part of the online user population. And my home machine's web stuff is flakey because I'm always trying new things out there. (A couple years ago, I had access through mediaone.com, now attbi.com, but I ran my search bot a couple of times, and they kicked me off for "hacking". I guess developing web search software is something that's permitted to companies but not to mere humans. ;-) | 4. Stores index info (title, key, meter, author, etc) in the database as well | 5. When asked, retrieves the tune and converts it to other formats. These are correct. | Would it be possible, or make sense, to also store a hash (like a CRC-32 or | an MD5 signature) of the canonical tune in the database, and only show | unique hashes on a query? Possibly, but the chances are that this would mostly exclude the primary source of a tune and show only a mirror. And when the mirror is down, there would be no way to find the others. | I think that would remove a large number of spurious duplicate results from | the query results. It wouldn't remove the work of different | transcriptionists, or variants in a tune, but it would remove duplications | caused by multiple copies of the same transcription floating around the web. Well, I personally like seeing the spurious duplicate results. The web can be sufficiently flakey at times that it's useful to be able to kill the download that's hung and go on to the next link. Maybe some day, when all the ISPs (and backbones) in the Net start providing good, reliable IP connectivity, this won't be important. I don't expect to live to see that day. At present, the motion is mostly in the other direction. A current computer industry news story is that Yahoo has entered an agreement with the Chinese government to supply filtering software and support for the "Public Pledge on Self-discipline for China Internet Industry". This isn't a good sign for Net reliability. If you want to see an especially funny story about such things, try: http://yro.slashdot.org/article.pl?sid=02/07/15/2039227&mode=thread&tid=153 When the commercial world is involved in such perversity, the most sensible thing is to encourage mirroring and redundancy in online information. To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html
