Re: [Nutch-dev] File system watching for intranets

Ben Ogle Wed, 13 Sep 2006 13:56:19 -0700

JNotify is only local. A simple mapping of paths to http locations could be
provided in some config file to get around that. Also, I figure that in an
intranet situation, the admin setting up nutch owns all of the other servers
that will need to be fetched from, so (s)he could install nutch on all those
machines to run this tool.


So the tool could be setup in a distributed intranet situation:

- admin sets up nutch similar to this:
http://wiki.apache.org/nutch/NutchHadoopTutorial
- admin crawls and starts this file watcher tool on each machine that has
searchable content

If I use a simple solution such as generating a fetch list when a file is
changed (or some amount of time after its changed to catch other changes),
then fetching and updating the db, my thought is that the tool would work as
follows:

- file changes on a slave node
  - slave node notifies the tool 
  - tool starts a map/reduce job to generate fetch list, fetch, update, etc.
  - name node (master node?) would be notified of the change to the file
system and index is updated
  
I don't really know how well that would work, though. Can slave nodes can
start map/reduce jobs? Should they? Would the task be distributed among the
other nodes? Ideally, I suppose, the slave node should react in the
following manner:

- file changes on a slave node
  - slave node notifies the tool 
  - tool notifies master node of update 
  - master node starts map reduce job to do the update
    - this would properly distribute the task of doing the update, right?
    
With this scenario, I am not sure how (or if its possible) to notify the
master node.

So maybe it doesn't scale well, but for an intranet such as ours with one
machine doing it all (which is probably similar a good majority of
intranets) it would provide a nice solution.

I hope there is more commentary on this topic, especially in a distributed
environment. I would like to come up with something that works in a good
range of intranet configurations.

Ben


Michael Wechner wrote:
> 
> Ben Ogle wrote:
> 
> I don't think there is any standardized way to do this yet. So every 
> step into this
> direction would be a great improvement.
> 
>> I mean, is there a nice
>>solution using the tools already provided?
>>
> 
> not that I am aware of, but I guess other people have tackled this as
> well.
> 
> I think it would be nice to generate a RSS or something similar as 
> fetchlist which
> could also be accessed by other crawlers
> 
>> I know each page is time stamped
>>in the database when it is fetched, but does this correspond to the last
>>modified date? 
> 
> I am still not sure if Nutch is actually comparing the last modifieds. I 
> know there exists something called
> "adddays", but this is more to postpone re-crawling for e.g. 30 days
> 
>>- Could this be done by using the existing generate/fetch/update cycle
with
>>a index update? Is there a way to just fetch and index the pages
necessary?
>>I suppose my tool could generate the fatch list(s) (I need to look into
this
>>more closely).
>>
>>- Are there any other libraries like JNotify to implement this
functionality
>>that anyone knows about? I haven't found any others.
>>  
>>
> 
> does JNotify also implement protocols, e.g. HTTP? In order to notify 
> accross networks,
> or does it only work locally?
> 
> Thanks
> 
> Michi
> 

-- 
View this message in context: 
http://www.nabble.com/File-system-watching-for-intranets-tf2260463.html#a6294406
Sent from the Nutch - Dev forum at Nabble.com.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] File system watching for intranets

Reply via email to