Hi Saravanaraj, For each URL, Nutch reads your filter file from top to bottom, until it finds a line (+ or -) that matches the URL. Then it stops reading. Therefore, any files inside E:/Index Samples/Index/ will be INCLUDED, because they match the line that says +^file:/E:/Index Samples/. I suggest you swap over the two lines in the filter file: put -^file:/E:/Index Samples/Index/ BEFORE +^file:/E:/Index Samples/; so that Nutch encounters it first, when deciding whether to include files in that directory. Regards, David. On Mon, 2006-02-06 at 09:03 +0530, Saravanaraj Duraisamy wrote: > Hi i am using nutch to index files in local FS and FTP. > > my filter file is > > -^(http|ftp|mailto): > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|png|PNG|jar)$ > [EMAIL PROTECTED] > -.*(/.+?)/.*?\1/.*?\1/ > +^file:/E:/Index Samples/ > -^file:/E:/Index Samples/Index/ > > but nutch crawls the forbidden folders also. is there a web db kind of thing > for files also. is it possible to make nutch to index files based on the > last modified date. > > can anybody suggest the datastructure for webdb (filedb??) for files. it > will be good to group files and create seperate segements for each group. so > if some files are changed, only those segments can be replaced. > > Rgds, > D.Saravanaraj
******************************************************************************** This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network. ********************************************************************************
