If you think this is a usefull feature for others you can create a custom IndexingPlugin (instead of adding it to BasicIndexingFilter), create a Patch file and attach the patch to a JIRA issue. Others can then vote for it.
It *is* an open source project, you know :) Rgrds, Thomas On 6/19/06, Roberto Monge <[EMAIL PROTECTED]> wrote: > Ok that's what I figured, just thought I'd check to see if there were any > url Normalizer tricks I could do. I'll just add a new field to the > BasicIndexingFilter where I replace the local file:/ url with my publicUrl > and display that on the search.jsp page. I think this would be a good > general feature. Many intranets have localized file repositories which would > be more effecitive to search locally but make them available via an http > service. My logs repository has a new folder for each day so I just have a > cron feed the file urls directly into nutch (no need to crawl). > > Thanks again, > > roberto > > > On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote: > > > > Likely org.apache.nutch.net.RegexUrlNormalizer will also change the > > URL in the database, thus affecting (re)fetching of your log files. > > Thus this might not be the way to go. > > > > Instead you might want to change the BasicIndexingFilter where the URL > > is indexed so that the change only affects the value stored in the > > Index - which is what you want. > > > > You have to figure it out, basically :) > > > > Rgrds, Thomas > > > > On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote: > > > Yes, correct, summary is retrieved from parse_text and not from the > > > content so it is not affected by this property. Should have checked > > > that, but I am happy that is works so far. > > > > > > Regarding the URL: I am not 100% sure it will serve your needs, but > > > you can investigate usage of the > > > org.apache.nutch.net.RegexUrlNormalizer. > > > > > > Rgrds, Thomas > > > > > > On 6/17/06, Roberto Monge <[EMAIL PROTECTED]> wrote: > > > > Thanks, it didn't see the fetcher.store.content attribute in 0.7 so i > > > > updated to .8-dev. It worked as advertised, it seems like summary and > > > > search context still work. The only thing affected was the cached > > view of > > > > the file. I didn't limit http.content.limit because i do want all of > > the > > > > log files indexed. > > > > > > > > My log files are on one servers filesystem, I want to index them via a > > local > > > > search fie:///logs but then present the url link as coming from an > > http root > > > > so that other users can fetch the files. Currently if fetch them from > > the > > > > local webserver but that's a little inneffient since i know where the > > files > > > > are locally on the FS. Has anyone done a local search but used http > > urls > > > > for the search results? > > > > > > > > I could modify search.jsp to replace my file:// root with an http > > root, but > > > > that seems a little hacky. Does anyone know if there is a regex-url > > filter > > > > for post processing of the link urls? I tried using the regex-url > > filter > > > > but it modified the url before the fetcher used it. I want to modify > > via > > > > regex when entered into the url index or when displayed. > > > > > > > > Thanks, > > > > > > > > -roberto > > > > > > > > > > > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote: > > > > > > > > > > I mean disable the cache link in the search.jsp. > > > > > > > > > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote: > > > > > > As far as I know, content in the segments is used to generate the > > > > > > summary in the search results and off course for the cache > > feature. > > > > > > > > > > > > If you don't need these you can adjust the fetcher.store.contentand > > > > > > http.content.limit config properties. Also you might have to > > change > > > > > > search.jsp. > > > > > > > > > > > > Rgrds, Thomas > > > > > > > > > > > > On 6/15/06, Roberto Monge <[EMAIL PROTECTED]> wrote: > > > > > > > I've been using nutch to index production log files from a > > client > > > > > > > application. It's been a great tool because we do get a large > > volume > > > > > of > > > > > > > logs from the field and often have to go through complicated > > pattern > > > > > > > searches. Lately we're have some issues managing the our disk > > > > > space. I > > > > > > > noticed that nutch keeps all of the content in the segments > > content > > > > > folder. > > > > > > > Is there a reason all of the content is stored? I didn't see > > any > > > > > obvious > > > > > > > setting for just indexing and not keeping the content. > > > > > > > > > > > > > > I do use the more search plugings to do filtering by date and > > > > > url. Maybe > > > > > > > these require the content in the content folders? Any help > > would be > > > > > muchly > > > > > > > appreciated. > > > > > > > > > > > > > > Roberto > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general