Re: possibly dirty, but FASTway to test full ARXIV indexing

Jan Iwaszkiewicz Wed, 3 Mar 2010 15:27:43 +0100

Roman Chyla wrote:

Hi,

Good to see that you are ready for resource-hungry tests!

What would be the FASTEST way to ACCESS pdf documents that we
harvested over the network?

I guess the fastest way is to take a list of files from Valkyrie andaccess/process them on machines with AFS access. We can set up a smalltesting facility at CERN. The alternative is to send jobs to the GRID orclouds, so that the files are downloaded directly from arXiv.

My idea is to package my indexing into one big python egg and
distribute this 'creature' with a starting script onto Hadoop or
whatever that Jan or Jukka make available.

OK. You can also think of doing the task in two parts:

1. semantic analysis which produces a new file stored in some FS (afs,hdfs...)

2. indexing the documents produced in the previous step

The advantage is that for the indexing step you can use standard tool -Katta (http://katta.sourceforge.net/), which will take care of doing thedistributed indexing with Hadoop and merging the index shardsautomatically (including fault tolerance and load-balancing issues).

The individual eggs will turn into a living pythons (somewhere in the
grid, I don't care where), then using special HTTP telepathy they ask
for new food

eg: http://some-machine.cern.ch/hungry

json reply:

{ 'wait': 1,
  'doc': 'new',
  'url': 'http://some-other-machine/pdf/010546.pdf',
  'id': 'arxiv:010546' }


So the python will wait one second, before s/he fetches
http://some-other-machine/pdf/010546.pdf

This should continue, until:

1. webserver serving pdf crashes,timeouts,explodes etc..
2. or sends '{'stop':1}

This Master-Worker task dispatcher will be useful in the GRID case. ForMapReduce however - the idea is that you pass the entire set of inputfiles as an argument and all the parallelization, load balancing andfault tolerance is done for you.

Please note, THIS IS MADE ONLY TO TEST, I DON'T WANT TO SPEND TIME
MAKING IT ROBUST OR SOMETHING...the pythons should produce Lucene
indexes, which will be send back and gathered somewhere (I hope that
is possible), and then we will manually merge them.

To test it, I just need to have one machine (webserver) to coordinate
it where I put my script, and some machine(s) serving the pdf files
over the network (apache).

What do you reckon?

You can start by doing basic tests at CERN and later send jobs to theGRID or the recently established D4Science Hadoop cloud. Why don't wemeet tomorrow, before the INSPIRE Evo meeting and discuss?

Roman

Cheers,
Jan

Re: possibly dirty, but FASTway to test full ARXIV indexing

Reply via email to