Re: [abcusers] abc repository similiar to olga.net?

2003-03-05 Thread Jack Campin
| 2. Make sure you aren't replicating something that's already been
|replicated, perhaps with mistakes or computer garblement en route.
|We don't need 105 identical versions of The Irish Washerwoman
|hiding one original take on the tune.  (An easy way for file
|providers to do this is to add an S: line giving the original
|URL if the tune is a literal copy of one from another site).
 Good idea. But figuring out how to do this right isn't easy. It's all
 too easy for a chunk of software to decide on the worst one. Having a
 human do it for 100,000 tunes would be a bit of an undertaking.

I was mainly concerned with the case where they're literally identical -
the many occasions where a tune has been copied unchanged from one site
to another.  That's a well and truly solved problem (it's what DNA-
matching software does).  If one is worse than another they can't be
the same; I wasn't proposing any sort of control on *musical* quality,
or even syntactical correctness: simply trying to make things easier
for the user who has many copies to choose among.


| 3. Provide a human contact for every file (you'll have this anyway
|if you've asked permission) - lots of ABC files raise questions,
|and the TuneFinder interface provides no way of getting answers
|to them, as what you get doesn't even have a URL included.
 My tune finder in fact does insert the URL and date if you ask for
 a tune  in TXT or ABC form.  It uses the F:  header line.

But not when you download an entire file (which is what I always do).


 Doing this turned out to be  tricker  than  one  might  expect.   The
 problem  was  the variety of line terminators.  Just inserting the F:
 line with an ANSI standard line terminator doesn't  work,  because  a
 lot  of  software  can't  handle  files  with  mixed  styles  of line
 terminators. I eventually found by experiment and a bit of email with
 people  who  had  problems  that  the  solution  was to strip out the
 terminators and make them all the same. It doesn't matter whether you
 use  \n or \r\n as long as they're all the same.

The reason for this, at least with the Mac stuff I know about, is that
most conversion utilities look at the first line in the file and try
to guess what convention it's using from that.  If the first line is
different from all the others this is maximally confusing for the poor
wee thing.

What do you do with a file in EBCDIC? - that makes these variations
look rather trivial.  Somebody must have ported abc2ps to IBM MVS or
ICL VME, surely?

Somebody sometime ought to figure out what has usually gone wrong
with all those sites where (at least as viewed from a Mac) all the
ABC is double-spaced.  I suspect somebody simply used the wrong flag
on an email or ftp client that does conversion on the fly, and that
the problem is quite easy to avoid no matter what OS and software
you've got.


 This URL doesn't directly give you an email address. [...] If a site's
 owner wants to remain incommunicado, it's fairly easy to do.

The Tune Finder can't do much about this, but a planned mirroring site
can.  There is no obligation to mirror stuff from people who want to
make life difficult for their readers and who believe they're such
important celebrities that nobody should be informed who they really
are.  In any case, if that's the way somebody thinks, they can make
their lawyer's office the contact.  (An email address might not be the
right sort of contact, and whatever is provided - ICQ number, mobile
phone number, EBay seller id - its validity  needs to be checked every
so often).

I wonder if we could do something like EBay ratings for tune providers?
A feedback message board, even?  (Henrik's Irish Washerwoman really
does wash whiter...)

The one concern I would have about having my own stuff mirrored is that
I'd want the mirror to encourage people to look at my own site too; to
a certain extent the ABC files I have available are advertising and I'd
like them to function that way.  Other people might want the opposite -
Demon's server can handle all the hits anyone's likely to throw my way
and I've set things up for maximum simplicity, but somebody whose primary
server has a wet-piece-of-string connection or a mega-inconvenient user
interface might to want to offload the work onto Toby's machines.

=== http://www.purr.demon.co.uk/jack/ ===


To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-04 Thread Jack Campin
 Has anyone thought of compiling a centralized database of abc tunes
 similar to olga.net.. I find that resource incredibly useful.
 Basically something like John's tune finder, except that it saves
 everything to a local database.
 I would be willing to donate computing power  storage space to such
 a project.

This could be a good idea, as your site is considerably more reliable
than MIT.  But a few things need to be done to ensure data quality:

1. Make sure the copy is up-to-date.  Most ABC files on the web don't
   change but some of the most interesting ones do.

2. Make sure you aren't replicating something that's already been
   replicated, perhaps with mistakes or computer garblement en route.
   We don't need 105 identical versions of The Irish Washerwoman
   hiding one original take on the tune.  (An easy way for file
   providers to do this is to add an S: line giving the original
   URL if the tune is a literal copy of one from another site).
   
3. Provide a human contact for every file (you'll have this anyway
   if you've asked permission) - lots of ABC files raise questions,
   and the TuneFinder interface provides no way of getting answers
   to them, as what you get doesn't even have a URL included.
   Also, as you're going to be writing a hell of a lot of I am not
   responsible for that content messages to people who fire queries
   at you, it would seem to be simple self-preservation to be able
   to name somebody who *is* responsible for it.

None of this should be difficult to arrange for files that are being
actively maintained by a live human.

=== http://www.purr.demon.co.uk/jack/ ===


To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-04 Thread Toby Rider

 Has anyone thought of compiling a centralized database of abc tunes
 similar to olga.net.. I find that resource incredibly useful.
 Basically something like John's tune finder, except that it saves
 everything to a local database.
 I would be willing to donate computing power  storage space to such a
 project.

 This could be a good idea, as your site is considerably more reliable
 than MIT.  But a few things need to be done to ensure data quality:


 Wow, thanks Jack.. That's a huge compliment, considering my entire
professional career is based on trying to make computers reliable.. That
means alot to me. Even though those machines are my personal ones, I try
to apply to same level of care to them as the machines that I earn my pay
with.




To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-04 Thread Steve Mansfield
Has anyone thought of compiling a centralized database of abc tunes
similar to olga.net..
/discussion

But I've still not heard anything that makes me think that this sort of 
centralised abc database has any advantage or real purpose.

JC's tune finder is a (wonderful wonderful wonderful) tool that gives 
quick search and retrieval access to any abc tune that it knows about, 
whether it be in Richard's tune book or Henrik's files or my files or 
wherever. So long as JC's tune finder knows about the file, there is 
immediate access.

If (naming two of the major collection maintainers as prime examples) 
Richard or Henrik suddenly threw a major hissy fit and completely 
removed their tune resources from the web, or suffered a catastrophic 
life or data event that wiped out the collections, we'd all be the 
poorer, and in that situation we'd be grateful that someone had taken a 
mirror before said event - but that isn't what we're talking about - is 
it?

--
Steve Mansfield
[EMAIL PROTECTED]
http://www.lesession.co.uk - abc music notation tutorial,
  the uk.music.folk newsgroup FAQ, and other goodies


To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-04 Thread Laura Conrad
 Steve == Steve Mansfield [EMAIL PROTECTED] writes:

 Has anyone thought of compiling a centralized database of abc tunes
 similar to olga.net..

Steve /discussion

Steve But I've still not heard anything that makes me think that
Steve this sort of centralised abc database has any advantage or
Steve real purpose.

I think if all it does is mirror (or worse copy) stuff that's already
on the net, it doesn't.

What I think might be of more value is something that combines a
mirror, or even just indexing and pointer like John's tune finder,
with something that allows submission of tunes that aren't on the net
yet.  There must be lots of people writing ABC who don't have a
website but would like to share their work.

http://www.cpdl.org would be a good model for this.  When Rafael Ornes
first approached me about having my stuff included, he was thinking in
terms of keeping copies on his site, and I told him I wasn't
comfortable with that, for the reasons other posters in this thread
have cited -- sometimes I make corrections or improvements, and I
don't want the uncorrected version lying around after that.

I imagine other people made the same objections, and now, the central
feature of the site is the database and searching facility, and
providing a link to be put in the database is the preferred way of
contributing, but I believe that you can also send your work to Rafael
and have it both in the database and on the cpdl website.

-- 
Laura (mailto:[EMAIL PROTECTED] , http://www.laymusic.org/ )
(617) 661-8097  fax: (801) 365-6574 
233 Broadway, Cambridge, MA 02139


To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-04 Thread Christian M. Cepel




Have you any experience with counter-attacking methods called 'tarpitting'
or 'quicksanding'... I don't recall which. I read a blurb about it a while
ago Specifically, intentionally timing out requests from snakes/spiders/etc
to bog their machine down to the point that they sit up and take notice and
possibly act more responsibly? Increasing their costs for such actions to
a point so high as to make them non-feasible in future.

//Christian

Btw.. I appreciate the below. There's a lot of info and jumping off info
that I will be incorporating in future.


John Chambers wrote:

  Christian M. Cepel asks:
| Does John's script obey robot exclusions?  I'm ready to kill Altavista
| for spidering my _javascript_ validated forms, submitting them empty, and
| completely ignoring robot exclusions.

Yes, it does.  The first thing it does for each site is asks for  the
robots.txt  file, and stays away from directories that have a general
exclusion.  The only exceptions are when someone specifically asks to
have  their  music  scanned,  and  then  their  directory  becomes an
"exception to the exclusion". I think this has only happened once.

I also have a significant tune  collection  (partly  from  extracting
tunes  from  lists  like  this one).  I was given write access to the
robots.txt file on the machine a couple of years ago, and it excludes
most  of  my  music stuff.  I've found that the big search sites just
aren't very good for finding music.  And then I have to list  my  own
directories  as  exceptions  to the robots.txt rules, as mentioned in
the previous paragraph.

OTOH, if I had a collection of abc songs with  lyrics,  I'd  probably
want  that  searched  by  the  big  guys.  They're all pretty good at
finding lyrics.

I know what you mean about the forms.  And there's a similar  problem
with  cgi  scripts.   Maybe  two  years  ago, I started reading about
research into searching for "hidden pages" on the web that  can  only
be  found via forms and scripts.  My reaction to this was "Uh-oh; I'd
better watch for this.  About a year ago they  hit.   Several  search
sites   started   invoking   my  lookup  script  systematically  with
random-looking arguments, and whem they got  a  reply  with  a  form,
started exploring the links.  They were, in effect, attempting to get
every abc tune on the web in every format that my scripts know how to
return.   One  of  them  hit  our server simultaneously from about 30
different addresses, and had over 100 tune convertions outstanding. It
brought the server to a screeching halt.

I got enough cpu time  to  add  a  "blacklist"  to  my  scripts,  and
whenever  I  see symptoms of this, I add their address (or subnet) to
the blacklist.  And I added a small (5 sec) minimum between  requests
from  the  same  address.   Both  of  these can be a hassle to people
working from behind a firewall, since what  my  scripts  see  is  the
firewall's  address, and all users behind it look like a single user.
But such things are  necessary  when  there  are  misbehaving  search
monsters out there.

One of the side effects of this is that I no longer tell  the  mailer
here  to  forward my email to my home machine.  I log in and read the
email here.  This means that I'm logged in several times during  most
days.  This is so that I can keep a constant watch for attacks on the
web server.  Most of these are probably not malicious; they are  more
likely from novice searchers.  But it's a good idea to spot them fast
and install defenses against the new ones.

My search program also has a sort of "reverse blacklist". In its list
of starting URLs, I can include URLs or hosts that are to be avoided.
I've mentioned this on lists that I subscribe to, with the idea  that
someone might not want their tunes indexed. So far I haven't actually
had anyone say they want to be avoided, but it's  a  possibility.   I
mostly  use  this  as a way to keep the search program away from some
sites that are known sinkholes of time with no abc tunes.  There  are
some  sites  that  have pages with millions of links, and such things
are best ignored.

Another thing I have my searcher do is ignore any URL with "cgi" as a
token, i.e., with non-letters on both sides. This is fairly effective
at preventing the invocation of scripts without arguments, and that's
almost  always a pure waste of time.  I've also been thinking of also
excluding things like "php", but so far that hasn't been necessary.

You can learn a lot of weird stuff when you try writing a web  search
program ...

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html
  


-- 

Christian Marcus Cepel  ("`-''-/").___..--''"`-._
[EMAIL PROTECTED] icq:12384980 `6_ 6  )   `-.  ( ).`-.__.`)
371 CrownPoint Columbia 65203-2202   (_Y_.)'  ._   )  `._ `. ``-..-'
w573.882.8309 h443.8676 m268.7533  _..`--'_..-_/  /--'_.' ,'
Computer Support Specialist, Sr.  (il),-''  (li),'  ((!.-'
School of Information 

[abcusers] abc repository similiar to olga.net?

2003-03-03 Thread Toby Rider
 Has anyone thought of compiling a centralized database of abc tunes
similar to olga.net.. I find that resource incredibly useful.
 Basically something like John's tune finder, except that it saves
everything to a local database.
 I would be willing to donate computing power  storage space to such a
project.

Toby




-- 
Toby Rider ([EMAIL PROTECTED])

- Some of those parts were totally rubbish, because when you think you're
playing well and you're drunk, you're actually playing like an idiot. -
Robert Smith


Toby Rider's Understated Homepage: http://www.blackmill.net/toby_rider/


To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-03 Thread Christian M. Cepel




I know there are many out there.  I'm fond of http://www.thesession.org/


Toby Rider wrote:

   Has anyone thought of compiling a centralized database of abc tunes
similar to olga.net.. I find that resource incredibly useful.
 Basically something like John's tune finder, except that it saves
everything to a local database.
 I would be willing to donate computing power  storage space to such a
project.

Toby




  


-- 

Christian Marcus Cepel  ("`-''-/").___..--''"`-._
[EMAIL PROTECTED] icq:12384980 `6_ 6  )   `-.  ( ).`-.__.`)
371 CrownPoint Columbia 65203-2202   (_Y_.)'  ._   )  `._ `. ``-..-'
w573.882.8309 h443.8676 m268.7533  _..`--'_..-_/  /--'_.' ,'
Computer Support Specialist, Sr.  (il),-''  (li),'  ((!.-'
School of Information Science  Learning Technologies, College of Ed,
University of Missouri - Columbia * And the wrens have returned  are
nesting *In the hollow of that oak where his heart once had been *And
he lifts his arms in a blessing *For being born again. --Rich Mullins







Re: [abcusers] abc repository similiar to olga.net?

2003-03-03 Thread John Chambers
Toby write:
|  Has anyone thought of compiling a centralized database of abc tunes
| similar to olga.net.. I find that resource incredibly useful.
|  Basically something like John's tune finder, except that it saves
| everything to a local database.
|  I would be willing to donate computing power  storage space to such a
| project.

Well, I've consider it. ;-)

Storage is one reason for not doing it.

Another is the question of whether (or which) sites' owners
would  agree  to being mirrored this way.  I'd guess that a
lot would, but a few would object to the idea.  I  wouldn't
want  to  collect  everyone  else's  tunes this way without
their permission.

Of course, google does this sort of thing, and  the  google
cache  is often very useful.  But I'd think we'd want a bit
of a public discussion before caching other people's  tunes
like this.


To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-03 Thread Toby Rider
Ah, but www.thesession.org requires people to submit their tunes to it.
something that combined John's indexing approach, along with a
comprehensive database for the abc's of the tunes, would be incredibly
sweet..




 I know there are many out there.   I'm fond of
 http://www.thesession.org/


 Toby Rider wrote:

 Has anyone thought of compiling a centralized database of abc tunes
similar to olga.net.. I find that resource incredibly useful.
 Basically something like John's tune finder, except that it saves
everything to a local database.
 I would be willing to donate computing power  storage space to such a
project.

Toby







 --

 Christian Marcus Cepel  (`-''-/).___..--''`-._
 [EMAIL PROTECTED] icq:12384980 `6_ 6  )   `-.  ( ).`-.__.`)
 371 CrownPoint Columbia 65203-2202   (_Y_.)'  ._   )  `._ `. ``-..-'
 w573.882.8309 h443.8676 m268.7533  _..`--'_..-_/  /--'_.' ,'
 Computer Support Specialist, Sr.  (il),-''  (li),'  ((!.-'
 School of Information Science  Learning Technologies, College of Ed,
 University of Missouri - Columbia * And the wrens have returned  are
 nesting *In the hollow of that oak where his heart once had been *And he
 lifts his arms in a blessing *For being born again. --Rich Mullins


-- 
Toby Rider ([EMAIL PROTECTED])

- Some of those parts were totally rubbish, because when you think you're
playing well when you're drunk, you're actually playing like an idiot. -
Robert Smith


Toby Rider's Understated Homepage: http://www.blackmill.net/toby_rider/


To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-03 Thread Christian M. Cepel




I'm confused. How is that different from Olga? Olga is made up entirely
of submitted ascii files, and ones that were pasted in the newsgroup for
examination/distribution. I understand that you're looking for something
different entirely... but it does seem that thesession.org matches olga in
this respect at least. Or am I missing something?

Toby Rider wrote:

  Ah, but www.thesession.org requires people to submit their tunes to it.
something that combined John's indexing approach, along with a
comprehensive database for the abc's of the tunes, would be incredibly
sweet..




  
  
I know there are many out there.   I'm fond of
http://www.thesession.org/


Toby Rider wrote:



  Has anyone thought of compiling a centralized database of abc tunes
similar to olga.net.. I find that resource incredibly useful.
Basically something like John's tune finder, except that it saves
everything to a local database.
I would be willing to donate computing power  storage space to such a
project.

Toby






  

--

Christian Marcus Cepel  ("`-''-/").___..--''"`-._
[EMAIL PROTECTED] icq:12384980 `6_ 6  )   `-.  ( ).`-.__.`)
371 CrownPoint Columbia 65203-2202   (_Y_.)'  ._   )  `._ `. ``-..-'
w573.882.8309 h443.8676 m268.7533  _..`--'_..-_/  /--'_.' ,'
Computer Support Specialist, Sr.  (il),-''  (li),'  ((!.-'
School of Information Science  Learning Technologies, College of Ed,
University of Missouri - Columbia * And the wrens have returned  are
nesting *In the hollow of that oak where his heart once had been *And he
lifts his arms in a blessing *For being born again. --Rich Mullins

  
  

  


-- 

Christian Marcus Cepel  ("`-''-/").___..--''"`-._
[EMAIL PROTECTED] icq:12384980 `6_ 6  )   `-.  ( ).`-.__.`)
371 CrownPoint Columbia 65203-2202   (_Y_.)'  ._   )  `._ `. ``-..-'
w573.882.8309 h443.8676 m268.7533  _..`--'_..-_/  /--'_.' ,'
Computer Support Specialist, Sr.  (il),-''  (li),'  ((!.-'
School of Information Science  Learning Technologies, College of Ed,
University of Missouri - Columbia * And the wrens have returned  are
nesting *In the hollow of that oak where his heart once had been *And
he lifts his arms in a blessing *For being born again. --Rich Mullins







Re: [abcusers] abc repository similiar to olga.net?

2003-03-03 Thread Tom Keays
There is Richard Moon's TuneDB http://tunedb.woodenflute.com/ which has
several thousand tunes in it.  It allows searching by name or abc fragment.
Very cool.

on 3/3/03 3:33 PM, Toby Rider wrote:

 Has anyone thought of compiling a centralized database of abc tunes
 similar to olga.net.. I find that resource incredibly useful.

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-03 Thread Christian M. Cepel
Does John's script obey robot exclusions?  I'm ready to kill Altavista 
for spidering my javascript validated forms, submitting them empty, and 
completely ignoring robot exclusions.

I see the difference now.. Thanks for explaining.

//Christian

Toby Rider wrote:

Yes, thesession.org does exactly what Olga does.. However combining the
indexing approach of the abc tune finder, along with a centralized
database like Olga, or thesession.org, would be even better.. The only
issue is permission.. Someone would have to contact every site with abc
tunes that we would possibly want to query for tunes and get permission.
 John is running a copy of the tune finder on one of my machines and I
periodically get emails asking why one of my IP addresses is spidering
their site.. I tell them what it's up to, and thy are usually cool about
it.
Toby



 

I'm confused.  How is that different from Olga?  Olga is made up
entirely of submitted ascii files, and ones that were pasted in the
newsgroup for examination/distribution.  I understand that you're
looking for something different entirely... but it does seem that
thesession.org matches olga in this respect at least.  Or am I missing
something?
Toby Rider wrote:

   

Ah, but www.thesession.org requires people to submit their tunes to it.
something that combined John's indexing approach, along with a
comprehensive database for the abc's of the tunes, would be incredibly
sweet..




 

I know there are many out there.   I'm fond of
http://www.thesession.org/
Toby Rider wrote:



   

Has anyone thought of compiling a centralized database of abc tunes
similar to olga.net.. I find that resource incredibly useful.
Basically something like John's tune finder, except that it saves
everything to a local database.
I would be willing to donate computing power  storage space to such
a project.
Toby







 

--

Christian Marcus Cepel  (`-''-/).___..--''`-._
[EMAIL PROTECTED] icq:12384980 `6_ 6  )   `-.  ( ).`-.__.`)
371 CrownPoint Columbia 65203-2202   (_Y_.)'  ._   )  `._ `. ``-..-'
w573.882.8309 h443.8676 m268.7533  _..`--'_..-_/  /--'_.' ,'
Computer Support Specialist, Sr.  (il),-''  (li),'  ((!.-'
School of Information Science  Learning Technologies, College of Ed,
University of Missouri - Columbia * And the wrens have returned  are
nesting *In the hollow of that oak where his heart once had been *And
he lifts his arms in a blessing *For being born again. --Rich Mullins
   



 

--

Christian Marcus Cepel  (`-''-/).___..--''`-._
[EMAIL PROTECTED] icq:12384980 `6_ 6  )   `-.  ( ).`-.__.`)
371 CrownPoint Columbia 65203-2202   (_Y_.)'  ._   )  `._ `. ``-..-'
w573.882.8309 h443.8676 m268.7533  _..`--'_..-_/  /--'_.' ,'
Computer Support Specialist, Sr.  (il),-''  (li),'  ((!.-'
School of Information Science  Learning Technologies, College of Ed,
University of Missouri - Columbia * And the wrens have returned  are
nesting *In the hollow of that oak where his heart once had been *And he
lifts his arms in a blessing *For being born again. --Rich Mullins
   



 

--

Christian Marcus Cepel  (`-''-/).___..--''`-._
[EMAIL PROTECTED] icq:12384980 `6_ 6  )   `-.  ( ).`-.__.`)
371 CrownPoint Columbia 65203-2202   (_Y_.)'  ._   )  `._ `. ``-..-'
w573.882.8309 h443.8676 m268.7533  _..`--'_..-_/  /--'_.' ,'
Computer Support Specialist, Sr.  (il),-''  (li),'  ((!.-'
School of Information Science  Learning Technologies, College of Ed,
University of Missouri - Columbia * And the wrens have returned  are
nesting *In the hollow of that oak where his heart once had been *And
he lifts his arms in a blessing *For being born again. --Rich Mullins


To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-03 Thread John Chambers
Christian M. Cepel asks:
| Does John's script obey robot exclusions?  I'm ready to kill Altavista
| for spidering my javascript validated forms, submitting them empty, and
| completely ignoring robot exclusions.

Yes, it does.  The first thing it does for each site is asks for  the
robots.txt  file, and stays away from directories that have a general
exclusion.  The only exceptions are when someone specifically asks to
have  their  music  scanned,  and  then  their  directory  becomes an
exception to the exclusion. I think this has only happened once.

I also have a significant tune  collection  (partly  from  extracting
tunes  from  lists  like  this one).  I was given write access to the
robots.txt file on the machine a couple of years ago, and it excludes
most  of  my  music stuff.  I've found that the big search sites just
aren't very good for finding music.  And then I have to list  my  own
directories  as  exceptions  to the robots.txt rules, as mentioned in
the previous paragraph.

OTOH, if I had a collection of abc songs with  lyrics,  I'd  probably
want  that  searched  by  the  big  guys.  They're all pretty good at
finding lyrics.

I know what you mean about the forms.  And there's a similar  problem
with  cgi  scripts.   Maybe  two  years  ago, I started reading about
research into searching for hidden pages on the web that  can  only
be  found via forms and scripts.  My reaction to this was Uh-oh; I'd
better watch for this.  About a year ago they  hit.   Several  search
sites   started   invoking   my  lookup  script  systematically  with
random-looking arguments, and whem they got  a  reply  with  a  form,
started exploring the links.  They were, in effect, attempting to get
every abc tune on the web in every format that my scripts know how to
return.   One  of  them  hit  our server simultaneously from about 30
different addresses, and had over 100 tune convertions outstanding. It
brought the server to a screeching halt.

I got enough cpu time  to  add  a  blacklist  to  my  scripts,  and
whenever  I  see symptoms of this, I add their address (or subnet) to
the blacklist.  And I added a small (5 sec) minimum between  requests
from  the  same  address.   Both  of  these can be a hassle to people
working from behind a firewall, since what  my  scripts  see  is  the
firewall's  address, and all users behind it look like a single user.
But such things are  necessary  when  there  are  misbehaving  search
monsters out there.

One of the side effects of this is that I no longer tell  the  mailer
here  to  forward my email to my home machine.  I log in and read the
email here.  This means that I'm logged in several times during  most
days.  This is so that I can keep a constant watch for attacks on the
web server.  Most of these are probably not malicious; they are  more
likely from novice searchers.  But it's a good idea to spot them fast
and install defenses against the new ones.

My search program also has a sort of reverse blacklist. In its list
of starting URLs, I can include URLs or hosts that are to be avoided.
I've mentioned this on lists that I subscribe to, with the idea  that
someone might not want their tunes indexed. So far I haven't actually
had anyone say they want to be avoided, but it's  a  possibility.   I
mostly  use  this  as a way to keep the search program away from some
sites that are known sinkholes of time with no abc tunes.  There  are
some  sites  that  have pages with millions of links, and such things
are best ignored.

Another thing I have my searcher do is ignore any URL with cgi as a
token, i.e., with non-letters on both sides. This is fairly effective
at preventing the invocation of scripts without arguments, and that's
almost  always a pure waste of time.  I've also been thinking of also
excluding things like php, but so far that hasn't been necessary.

You can learn a lot of weird stuff when you try writing a web  search
program ...

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] abc repository similiar to olga.net?

2003-03-03 Thread John Chambers
Toby asks:
| Good question.. John, do you have an answer?

I wrote about that before seeing this message.

| On a similiar note (no pun intended), I'm actually quite impressed at how
| efficient John's program is.. He's really quite a hand at Perl.. Perl
| programs are notoriously CPU hungry.. John's program runs really tight..
| That machine also serves up about 10 moderate traffic websites, runs lpd
| for a couple printers, and has the Thunderstone seach engine periodically
| cranking away.. I never even notice John's program running away in the
| background..

An interesting aspect to the perl story is that it's  performance  in
many  cases  is competetive with even fairly good C code.  There have
been a number of reports of people who decide to rewrite an important
perl  program  in C, and find that the C version is slower.  The perl
gang has learned some good tricks, and unless you know  a  lot  about
what  you're doing, you'll have trouble matching what they've learned
over the years.

The main reason that a perl program  can  gobble  cpu  is  that  some
things  are  very  easy  in  perl  that  are  difficult in most other
languages.   The  language  includes  symbol-table   lookups   in   a
deceptively  simple  form,  as  a  kind of array that takes character
strings as a subscript.  It's so easy to use  that  perl  programmers
learn  to use it for everything.  Anyone who has ever written a table
lookup routine knows how much cpu  time  it  takes.   In  most  other
languages,  a symbol table is a big hairy deal that you use only as a
last resort.  In perl, you use them because it's easy.   And  if  you
don't  understand the implications, you can end up with a very greedy
little program. If you understand, it's just another very handy tool.
I  use  tables  a  lot,  but  I'm  always aware that that very simple
indexing operation is expensive. But the perl interpreter has some of
the most sophisticated table-handling routine known.  Unless you're a
real expert, you aren't going to improve on them.

Perl can also gobble memory.  One of the features of the language  is
the  ability  to slurp up (a technical term) an entire file into an
array of strings.  It only takes a few characters of punctuation:
   @data = FILE;
This reads the entire contents of FILE into the data array. It's fast
and  easy,  and  there  are  a lot of things that will operate on the
entire array.  Then the command
   @data = ();
frees the space.  This is a powerful part of perl.  But if you aren't
aware  of  what it does, it can produce a monster program.  My search
bot doesn't do this.  In fact, it uses fixed-length reads,  to  avoid
the  problems  of web sites like Mac sites that don't have line feeds
within their pages.

| Of course having dual CPU's on there and alot of RAM helps :-)

Yes, and my code is single-threaded, so it shouldn't  ever  use  more
than one cpu. It spends most of its time waiting for a TCP connection
to go through.  This typically takes longer than reading the data.  A
web  search  program  that makes only one connection at a time really
can't use much cpu time.  Most of its time will be spent  waiting  on
network events.

OTOH, I've been contemplating stuffing some info into a database ...

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html