Hi Roman,
See may comments below:
Roman Chyla wrote:
What would be the FASTEST way to ACCESS pdf documents that we
harvested over the network?
I guess the fastest way is to take a list of files from Valkyrie and
access/process them on machines with AFS access. We can set up a small
testing facility at CERN. The alternative is to send jobs to the GRID or
clouds, so that the files are downloaded directly from arXiv.
I think arXiv harvesting is not option for the moment
My idea is to package my indexing into one big python egg and
distribute this 'creature' with a starting script onto Hadoop or
whatever that Jan or Jukka make available.
OK. You can also think of doing the task in two parts:
1. semantic analysis which produces a new file stored in some FS (afs,
hdfs...)
2. indexing the documents produced in the previous step
This dirty test should just show if we are able to process the files
in some reasonable time (and it will show roughly the indexing speed
as well - for the moment, splitting in two phases seem like
complication to me)
The advantage is that for the indexing step you can use standard tool -
Katta (http://katta.sourceforge.net/), which will take care of doing the
distributed indexing with Hadoop and merging the index shards automatically
(including fault tolerance and load-balancing issues).
But I actually need Python to be involved for the indexing, so Katta
is not an option for me - unless it can merge indexes for us (I
haven't studied details) - the indexing must be controlled by python
The individual eggs will turn into a living pythons (somewhere in the
grid, I don't care where), then using special HTTP telepathy they ask
for new food
eg: http://some-machine.cern.ch/hungry
json reply:
{ 'wait': 1,
'doc': 'new',
'url': 'http://some-other-machine/pdf/010546.pdf',
'id': 'arxiv:010546' }
So the python will wait one second, before s/he fetches
http://some-other-machine/pdf/010546.pdf
This should continue, until:
1. webserver serving pdf crashes,timeouts,explodes etc..
2. or sends '{'stop':1}
This Master-Worker task dispatcher will be useful in the GRID case. For
MapReduce however - the idea is that you pass the entire set of input files
as an argument and all the parallelization, load balancing and fault
tolerance is done for you.
So, if I get it right, we are going to use Hadoop for now?
1. I prepare linux executables
2. somehow partition the files to be indexed into groups (I need to
know how many machines/group to prepare?)
3. send the jobs (i don't know how to send them?)
I think that for your jobs, almost any paralliezation can be used. For
sure they fit well the GIRD and Hadoop model. However the GRID won't
have good access to the mirrored arXiv files. Also the map reduce case,
the step 2. is done automatically and as I still hope - later also the
index merging can be.
for the moment, I still don't know for now if I will be able to freeze
everythin into a linux binary (that would be the simplest option for
running) - if not, it will be python packages
I am not going to send the input files, right? I don't want to deal
with packaging and shuffiling of gigabytes over the network. I can
send list of files, correct?
Sure.
...the last time i tried accessing the arxiv docs, I had to login to
some machine, so the Hadoop machines must have rights to access them
They files should be publicly visible from any machine with AFS access.
At least I don't see any reason why not?
Please note, THIS IS MADE ONLY TO TEST, I DON'T WANT TO SPEND TIME
MAKING IT ROBUST OR SOMETHING...the pythons should produce Lucene
indexes, which will be send back and gathered somewhere (I hope that
is possible), and then we will manually merge them.
To test it, I just need to have one machine (webserver) to coordinate
it where I put my script, and some machine(s) serving the pdf files
over the network (apache).
What do you reckon?
You can start by doing basic tests at CERN and later send jobs to the GRID
or the recently established D4Science Hadoop cloud. Why don't we meet
tomorrow, before the INSPIRE Evo meeting and discuss?
I won't be able to come tomorrow, I am sorry, we will have to discuss
some other day -- but if you say we will run it on Hadoop now, no
problem. We will make it one way or other.
Once you have a stable binary or Python package ready, you can send some
test jobs to the GRID as well as to Hadoop facility and make a comparison.
I will now set up test a machine with Hadoop interface and let you know
to avoid spamming the list.
Thanks for input!
roman
Cheers,
Jan