possibly dirty, but FASTway to test full ARXIV indexing

Roman Chyla Wed, 3 Mar 2010 11:02:47 +0100

Hi,

What would be the FASTEST way to ACCESS pdf documents that we
harvested over the network?


My idea is to package my indexing into one big python egg and
distribute this 'creature' with a starting script onto Hadoop or
whatever that Jan or Jukka make available.

The individual eggs will turn into a living pythons (somewhere in the
grid, I don't care where), then using special HTTP telepathy they ask
for new food

eg: http://some-machine.cern.ch/hungry

json reply:

{ 'wait': 1,
  'doc': 'new',
  'url': 'http://some-other-machine/pdf/010546.pdf',
  'id': 'arxiv:010546' }


So the python will wait one second, before s/he fetches
http://some-other-machine/pdf/010546.pdf

This should continue, until:

1. webserver serving pdf crashes,timeouts,explodes etc..
2. or sends '{'stop':1}


Please note, THIS IS MADE ONLY TO TEST, I DON'T WANT TO SPEND TIME
MAKING IT ROBUST OR SOMETHING...the pythons should produce Lucene
indexes, which will be send back and gathered somewhere (I hope that
is possible), and then we will manually merge them.

To test it, I just need to have one machine (webserver) to coordinate
it where I put my script, and some machine(s) serving the pdf files
over the network (apache).

What do you reckon?

Roman

possibly dirty, but FASTway to test full ARXIV indexing

Reply via email to