Hi David, Thanks... Is there a way in nutch to reindex the files based on the last modified date??? I have large numbers of pdf's and doc's in a folder. Do i need to reindex all the files every time i want to update my index?
On 2/8/06, David Wallace <[EMAIL PROTECTED]> wrote: > > Hi Saravanaraj, > For each URL, Nutch reads your filter file from top to bottom, until it > finds a line (+ or -) that matches the URL. Then it stops reading. > Therefore, any files inside E:/Index Samples/Index/ will be INCLUDED, > because they match the line that says +^file:/E:/Index Samples/. > > I suggest you swap over the two lines in the filter file: put > -^file:/E:/Index Samples/Index/ BEFORE +^file:/E:/Index Samples/; so > that Nutch encounters it first, when deciding whether to include files > in that directory. > > Regards, > David. > > > On Mon, 2006-02-06 at 09:03 +0530, Saravanaraj Duraisamy wrote: > > Hi i am using nutch to index files in local FS and FTP. > > > > my filter file is > > > > -^(http|ftp|mailto): > > > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|png|PNG|jar)$ > > [EMAIL PROTECTED] > > -.*(/.+?)/.*?\1/.*?\1/ > > +^file:/E:/Index Samples/ > > -^file:/E:/Index Samples/Index/ > > > > but nutch crawls the forbidden folders also. is there a web db kind > of thing > > for files also. is it possible to make nutch to index files based on > the > > last modified date. > > > > can anybody suggest the datastructure for webdb (filedb??) for files. > it > > will be good to group files and create seperate segements for each > group. so > > if some files are changed, only those segments can be replaced. > > > > Rgds, > > D.Saravanaraj > > > > > ******************************************************************************** > This email may contain legally privileged information and is intended only > for the addressee. It is not necessarily the official view or > communication of the New Zealand Qualifications Authority. If you are not > the intended recipient you must not use, disclose, copy or distribute this > email or > information in it. If you have received this email in error, please > contact the sender immediately. NZQA does not accept any liability for > changes made to this email or attachments after sending by NZQA. > > All emails have been scanned for viruses and content by MailMarshal. > NZQA reserves the right to monitor all email communications through its > network. > > > ******************************************************************************** > >
