>> What would be the FASTEST way to ACCESS pdf documents that we
>> harvested over the network?
>>
>
> I guess the fastest way is to take a list of files from Valkyrie and
> access/process them on machines with AFS access. We can set up a small
> testing facility at CERN. The alternative is to send jobs to the GRID or
> clouds, so that the files are downloaded directly from arXiv.

I think arXiv harvesting is not option for the moment

>>
>> My idea is to package my indexing into one big python egg and
>> distribute this 'creature' with a starting script onto Hadoop or
>> whatever that Jan or Jukka make available.
>>
>
> OK. You can also think of doing the task in two parts:
> 1. semantic analysis which produces a new file stored in some FS (afs,
> hdfs...)
> 2. indexing the documents produced in the previous step

This dirty test should just show if we are able to process the files
in some reasonable time (and it will show roughly the indexing speed
as well - for the moment, splitting in two phases seem like
complication to me)

>
> The advantage is that for the indexing step you can use standard tool -
> Katta (http://katta.sourceforge.net/), which will take care of doing the
> distributed indexing with Hadoop and merging the index shards automatically
> (including fault tolerance and load-balancing issues).

But I actually need Python to be involved for the indexing, so Katta
is not an option for me - unless it can merge indexes for us (I
haven't studied details) - the indexing must be controlled by python

>>
>> The individual eggs will turn into a living pythons (somewhere in the
>> grid, I don't care where), then using special HTTP telepathy they ask
>> for new food
>>
>> eg: http://some-machine.cern.ch/hungry
>>
>> json reply:
>>
>> { 'wait': 1,
>>  'doc': 'new',
>>  'url': 'http://some-other-machine/pdf/010546.pdf',
>>  'id': 'arxiv:010546' }
>>
>>
>> So the python will wait one second, before s/he fetches
>> http://some-other-machine/pdf/010546.pdf
>>
>> This should continue, until:
>>
>> 1. webserver serving pdf crashes,timeouts,explodes etc..
>> 2. or sends '{'stop':1}
>>
>>
>
> This Master-Worker task dispatcher will be useful in the GRID case. For
> MapReduce however - the idea is that you pass the entire set of input files
> as an argument and all the parallelization, load balancing and fault
> tolerance is done for you.

So, if I get it right, we are going to use Hadoop for now?

1. I prepare linux executables
2. somehow partition the files to be indexed into groups (I need to
know how many machines/group to prepare?)
3. send the jobs (i don't know how to send them?)

for the moment, I still don't know for now if I will be able to freeze
everythin into a linux binary (that would be the simplest option for
running) - if not, it will be python packages

I am not going to send the input files, right? I don't want to deal
with packaging and shuffiling of gigabytes over the network. I can
send list of files, correct?

...the last time i tried accessing the arxiv docs, I had to login to
some machine, so the Hadoop machines must have rights to access them

>>
>> Please note, THIS IS MADE ONLY TO TEST, I DON'T WANT TO SPEND TIME
>> MAKING IT ROBUST OR SOMETHING...the pythons should produce Lucene
>> indexes, which will be send back and gathered somewhere (I hope that
>> is possible), and then we will manually merge them.
>>
>> To test it, I just need to have one machine (webserver) to coordinate
>> it where I put my script, and some machine(s) serving the pdf files
>> over the network (apache).
>>
>> What do you reckon?
>>
>>
>
> You can start by doing basic tests at CERN and later send jobs to the GRID
> or the recently established D4Science Hadoop cloud. Why don't we meet
> tomorrow, before the INSPIRE Evo meeting and discuss?

I won't be able to come tomorrow, I am sorry, we will have to discuss
some other day -- but if you say we will run it on Hadoop now, no
problem. We will make it one way or other.

Thanks for input!

roman

>
>> Roman
>>
>
> Cheers,
> Jan
>
>

Reply via email to