Re: managing content size in segments folder

TDLN Sat, 17 Jun 2006 00:45:51 -0700

Yes, correct, summary is retrieved from parse_text and not from the
content so it is not affected by this property. Should have checked
that, but I am happy that is works so far.


Regarding the URL: I am not 100% sure it will serve your needs, but
you can investigate usage of the
org.apache.nutch.net.RegexUrlNormalizer.

Rgrds, Thomas

On 6/17/06, Roberto Monge <[EMAIL PROTECTED]> wrote:

Thanks, it didn't see the fetcher.store.content attribute in 0.7 so i
updated to .8-dev.  It worked as advertised, it seems like summary and
search context still work.  The only thing affected was the cached view of
the file.  I didn't limit  http.content.limit because i do want all of the
log files indexed.

My log files are on one servers filesystem, I want to index them via a local
search fie:///logs but then present the url link as coming from an http root
so that other users can fetch the files.  Currently if fetch them from the
local webserver but that's a little inneffient since i know where the files
are locally on the FS.  Has anyone done a local search but used http urls
for the search results?

I could modify search.jsp to replace my file:// root with an http root, but
that seems a little hacky.  Does anyone know if there is a regex-url filter
for post processing of the link urls?  I tried using the regex-url filter
but it modified the url before the fetcher used it.  I want to modify via
regex when entered into the url index or when displayed.

Thanks,

-roberto

On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
>
> I mean disable the cache link in the search.jsp.
>
> On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
> > As far as I know, content in the segments is used to generate the
> > summary in the search results and off course for the cache feature.
> >
> > If you don't need these you can adjust the fetcher.store.content and
> > http.content.limit config properties. Also you might have to change
> > search.jsp.
> >
> > Rgrds, Thomas
> >
> > On 6/15/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> > > I've been using nutch to index production log files from a client
> > > application.  It's been a great tool because we do get a large volume
> of
> > > logs from the field and often have to go through complicated pattern
> > > searches.  Lately we're have some issues managing the our disk
> space.  I
> > > noticed that nutch keeps all of the content in the segments content
> folder.
> > > Is there a reason all of the content is stored?  I didn't see any
> obvious
> > > setting for just indexing and not keeping the content.
> > >
> > > I do use the more search plugings to do filtering by date and
> url.  Maybe
> > > these require the content in the content folders?  Any help would be
> muchly
> > > appreciated.
> > >
> > > Roberto
> > >
> > >
> >
>

Re: managing content size in segments folder

Reply via email to