Roman Chyla wrote:
Hi,
Good to see that you are ready for resource-hungry tests!
What would be the FASTEST way to ACCESS pdf documents that we
harvested over the network?
I guess the fastest way is to take a list of files from Valkyrie and
access/process them on machines with AFS access. We can set up a small
testing facility at CERN. The alternative is to send jobs to the GRID or
clouds, so that the files are downloaded directly from arXiv.
My idea is to package my indexing into one big python egg and
distribute this 'creature' with a starting script onto Hadoop or
whatever that Jan or Jukka make available.
OK. You can also think of doing the task in two parts:
1. semantic analysis which produces a new file stored in some FS (afs,
hdfs...)
2. indexing the documents produced in the previous step
The advantage is that for the indexing step you can use standard tool -
Katta (http://katta.sourceforge.net/), which will take care of doing the
distributed indexing with Hadoop and merging the index shards
automatically (including fault tolerance and load-balancing issues).
The individual eggs will turn into a living pythons (somewhere in the
grid, I don't care where), then using special HTTP telepathy they ask
for new food
eg: http://some-machine.cern.ch/hungry
json reply:
{ 'wait': 1,
'doc': 'new',
'url': 'http://some-other-machine/pdf/010546.pdf',
'id': 'arxiv:010546' }
So the python will wait one second, before s/he fetches
http://some-other-machine/pdf/010546.pdf
This should continue, until:
1. webserver serving pdf crashes,timeouts,explodes etc..
2. or sends '{'stop':1}
This Master-Worker task dispatcher will be useful in the GRID case. For
MapReduce however - the idea is that you pass the entire set of input
files as an argument and all the parallelization, load balancing and
fault tolerance is done for you.
Please note, THIS IS MADE ONLY TO TEST, I DON'T WANT TO SPEND TIME
MAKING IT ROBUST OR SOMETHING...the pythons should produce Lucene
indexes, which will be send back and gathered somewhere (I hope that
is possible), and then we will manually merge them.
To test it, I just need to have one machine (webserver) to coordinate
it where I put my script, and some machine(s) serving the pdf files
over the network (apache).
What do you reckon?
You can start by doing basic tests at CERN and later send jobs to the
GRID or the recently established D4Science Hadoop cloud. Why don't we
meet tomorrow, before the INSPIRE Evo meeting and discuss?
Roman
Cheers,
Jan