Armel T. Nene wrote: > * Nutch topN - we have to set the amount of the pages that we want to > fetch from a root url. When setting topN values Nutch will crawl and fetch > the number of files given. Therefore, when updating the index (re-crawl) > Nutch will go and take the same topN files in the directory. The problem > arise when you just want to fetch a number of files at a time, therefore at > the next crawl, the crawler only fetches new files from the directory or > updates changes in the index. I think the problem with real-adaptive
You could find your new or modified files easier outside of nutch with command like find <root> -mtime <nDays>|sed "s/^/file:\/\//" > fetchlist.txt Which would generate a file of file urls found under <root> that are last changed <nDays> ago. You would then inject the list and generate,fetch,updatedb,index and search Bootstrapping the url list would also happen with find without extra parameters. > fetching feature in Nutch is not possible unless Nutch changes the way it > indexes. By that I mean, when Nutch indexes after a crawl or a re-crawl, > Nutch doesn't update the current index but creates a new index after each > indexing. If Nutch has the ability to update its existing index, it will be It would be interesting to experiment with real time indexing hooks in fetcher so it could feed the content into Solr for example when it's hot. > * I am not entirely sure if this is a bug but here the issue: I have > set Nutch on MS Windows Server 2003. I have several logical drive such as; > C, D, E and etc. Nutch is set and running on drive D but when I try to crawl > a directory from another drive it fails with FileProtcol error 404. I know > error 404 is for file not found error code. I can crawl any directories from > the drive where Nutch is installed. I tested it on different Windows server > and drive but had the same error code. Can you let me know if that's a known > bug or just a configuration issue from my part. Nutch works fine when it > crawls any directory in its installed drive. Can't comment on that because I have no windows available. -- Sami Siren ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers