If you think this is a usefull feature for others you can create a custom IndexingPlugin (instead of adding it to BasicIndexingFilter), create a Patch file and attach the patch to a JIRA issue. Others can then vote for it.
It *is* an open source project, you know :) Rgrds, Thomas On 6/19/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
Ok that's what I figured, just thought I'd check to see if there were any url Normalizer tricks I could do. I'll just add a new field to the BasicIndexingFilter where I replace the local file:/ url with my publicUrl and display that on the search.jsp page. I think this would be a good general feature. Many intranets have localized file repositories which would be more effecitive to search locally but make them available via an http service. My logs repository has a new folder for each day so I just have a cron feed the file urls directly into nutch (no need to crawl). Thanks again, roberto On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote: > > Likely org.apache.nutch.net.RegexUrlNormalizer will also change the > URL in the database, thus affecting (re)fetching of your log files. > Thus this might not be the way to go. > > Instead you might want to change the BasicIndexingFilter where the URL > is indexed so that the change only affects the value stored in the > Index - which is what you want. > > You have to figure it out, basically :) > > Rgrds, Thomas > > On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote: > > Yes, correct, summary is retrieved from parse_text and not from the > > content so it is not affected by this property. Should have checked > > that, but I am happy that is works so far. > > > > Regarding the URL: I am not 100% sure it will serve your needs, but > > you can investigate usage of the > > org.apache.nutch.net.RegexUrlNormalizer. > > > > Rgrds, Thomas > > > > On 6/17/06, Roberto Monge <[EMAIL PROTECTED]> wrote: > > > Thanks, it didn't see the fetcher.store.content attribute in 0.7 so i > > > updated to .8-dev. It worked as advertised, it seems like summary and > > > search context still work. The only thing affected was the cached > view of > > > the file. I didn't limit http.content.limit because i do want all of > the > > > log files indexed. > > > > > > My log files are on one servers filesystem, I want to index them via a > local > > > search fie:///logs but then present the url link as coming from an > http root > > > so that other users can fetch the files. Currently if fetch them from > the > > > local webserver but that's a little inneffient since i know where the > files > > > are locally on the FS. Has anyone done a local search but used http > urls > > > for the search results? > > > > > > I could modify search.jsp to replace my file:// root with an http > root, but > > > that seems a little hacky. Does anyone know if there is a regex-url > filter > > > for post processing of the link urls? I tried using the regex-url > filter > > > but it modified the url before the fetcher used it. I want to modify > via > > > regex when entered into the url index or when displayed. > > > > > > Thanks, > > > > > > -roberto > > > > > > > > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote: > > > > > > > > I mean disable the cache link in the search.jsp. > > > > > > > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote: > > > > > As far as I know, content in the segments is used to generate the > > > > > summary in the search results and off course for the cache > feature. > > > > > > > > > > If you don't need these you can adjust the fetcher.store.contentand > > > > > http.content.limit config properties. Also you might have to > change > > > > > search.jsp. > > > > > > > > > > Rgrds, Thomas > > > > > > > > > > On 6/15/06, Roberto Monge <[EMAIL PROTECTED]> wrote: > > > > > > I've been using nutch to index production log files from a > client > > > > > > application. It's been a great tool because we do get a large > volume > > > > of > > > > > > logs from the field and often have to go through complicated > pattern > > > > > > searches. Lately we're have some issues managing the our disk > > > > space. I > > > > > > noticed that nutch keeps all of the content in the segments > content > > > > folder. > > > > > > Is there a reason all of the content is stored? I didn't see > any > > > > obvious > > > > > > setting for just indexing and not keeping the content. > > > > > > > > > > > > I do use the more search plugings to do filtering by date and > > > > url. Maybe > > > > > > these require the content in the content folders? Any help > would be > > > > muchly > > > > > > appreciated. > > > > > > > > > > > > Roberto > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
