I would like to see something as active, in process
and  inbound.  Active data is live and on the query
servers (both indexes and correlating segments) in
process are tasks currently being mapped out and
inbound is processes/data that is pending to be
processed.

Active nodes report as in the search pool.  In process
nodes are really "data" nodes doing all of the number
crunching/import/merging/indexing and inbound is
everything in fetch/pre processing.

The cycle would be a pull cycle. Active nodes pull
from the corresponding data nodes in turn pull from
the corresponding inbound nodes.  Events/batches could
trigger the pull so that it is a complete or useable
data set. Some light weight workflow engine could
allow you to process/manage the cycle of data.

i would like to see dfs block aware - able to process
the data on the active data server where that data
resides (as much as possible).. such a "file table"
could be used to associate the data through the entire
process stream and allow for a fairly linear growth.  

Such a system could also be aware of its own capacity
in that the inbound processes will fail/halt if disk
space on the dfs system isn't capable of handling new
tasks and vice versa if the active nodes are at
capacity tasks could be told to stop/hold.  You could
use this logic to add more nodes where necessary and
resume processing and chart your growth.

I come from the ERP/Oracle world so i very much have
learned to appreciate the distributed architecture and
concept of an aware system such as concurrent
processing that is distributed across many nodes and
aware of the status of the task and able to hold/wait
or act upon the condition of the system and grow
fairly linearly as needed.

-byron

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:

> Hi Andrzej,
> 
> >
> > * merge 80 segments into 1. A lot of IO
> involved... and you have to  
> > repeat it from time to time. Ugly.
> I agree.
> >
> > * implement a search server as a map task. Several
> challenges: it  
> > needs to partition the Lucene index, and it has to
> copy all parts  
> > of segments and indexes from DFS to the local
> storage, otherwise  
> > performance will suffer. However, the number of
> open files per  
> > machine would be reduced, because (ideally) each
> machine would deal  
> > with few or a single part of segment and a single
> part of index...
> 
> Well I played around and already had a kind of
> prototype.
> I had seen following problems:
> 
> + having a kind of repository of active search
> servers
> possibility A: find all tasktrackers running a
> specific task (already  
> discussed in the hadoop mailing list)
> possibility B: having a rpc server running in the
> jvm that runs the  
> search server client, add the hostname to the
> jobconf and similar to  
> task - jobtracker search server announce itself via
> hardbeat to the  
> search server 'repository'.
> 
> + having the index locally and the segment in the
> dfs.
> ++ adding to NutchBean init a dfs for index and one
> for segments  
> could fix this, or more general add support for
> streamhandlers like  
> dfs:// vs file://. (very long term)
> 
> + downloading an index from dfs until the mapper
> starts or just index  
> the segment data to local hdd and let the mapper run
> for the next 30  
> days?
> 
> Stefan 
> 



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to