Hi, What would be the FASTEST way to ACCESS pdf documents that we harvested over the network?
My idea is to package my indexing into one big python egg and distribute this 'creature' with a starting script onto Hadoop or whatever that Jan or Jukka make available. The individual eggs will turn into a living pythons (somewhere in the grid, I don't care where), then using special HTTP telepathy they ask for new food eg: http://some-machine.cern.ch/hungry json reply: { 'wait': 1, 'doc': 'new', 'url': 'http://some-other-machine/pdf/010546.pdf', 'id': 'arxiv:010546' } So the python will wait one second, before s/he fetches http://some-other-machine/pdf/010546.pdf This should continue, until: 1. webserver serving pdf crashes,timeouts,explodes etc.. 2. or sends '{'stop':1} Please note, THIS IS MADE ONLY TO TEST, I DON'T WANT TO SPEND TIME MAKING IT ROBUST OR SOMETHING...the pythons should produce Lucene indexes, which will be send back and gathered somewhere (I hope that is possible), and then we will manually merge them. To test it, I just need to have one machine (webserver) to coordinate it where I put my script, and some machine(s) serving the pdf files over the network (apache). What do you reckon? Roman
