Re: [Nutch-general] managing content size in segments folder

TDLN Sat, 17 Jun 2006 01:12:27 -0700

Likely org.apache.nutch.net.RegexUrlNormalizer will also change the
URL in the database, thus affecting (re)fetching of your log files.
Thus this might not be the way to go.


Instead you might want to change the BasicIndexingFilter where the URL
is indexed so that the change only affects the value stored in the
Index - which is what you want.

You have to figure it out, basically :)

Rgrds, Thomas

On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote:
> Yes, correct, summary is retrieved from parse_text and not from the
> content so it is not affected by this property. Should have checked
> that, but I am happy that is works so far.
>
> Regarding the URL: I am not 100% sure it will serve your needs, but
> you can investigate usage of the
> org.apache.nutch.net.RegexUrlNormalizer.
>
> Rgrds, Thomas
>
> On 6/17/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> > Thanks, it didn't see the fetcher.store.content attribute in 0.7 so i
> > updated to .8-dev.  It worked as advertised, it seems like summary and
> > search context still work.  The only thing affected was the cached view of
> > the file.  I didn't limit  http.content.limit because i do want all of the
> > log files indexed.
> >
> > My log files are on one servers filesystem, I want to index them via a local
> > search fie:///logs but then present the url link as coming from an http root
> > so that other users can fetch the files.  Currently if fetch them from the
> > local webserver but that's a little inneffient since i know where the files
> > are locally on the FS.  Has anyone done a local search but used http urls
> > for the search results?
> >
> > I could modify search.jsp to replace my file:// root with an http root, but
> > that seems a little hacky.  Does anyone know if there is a regex-url filter
> > for post processing of the link urls?  I tried using the regex-url filter
> > but it modified the url before the fetcher used it.  I want to modify via
> > regex when entered into the url index or when displayed.
> >
> > Thanks,
> >
> > -roberto
> >
> >
> > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
> > >
> > > I mean disable the cache link in the search.jsp.
> > >
> > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
> > > > As far as I know, content in the segments is used to generate the
> > > > summary in the search results and off course for the cache feature.
> > > >
> > > > If you don't need these you can adjust the fetcher.store.content and
> > > > http.content.limit config properties. Also you might have to change
> > > > search.jsp.
> > > >
> > > > Rgrds, Thomas
> > > >
> > > > On 6/15/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> > > > > I've been using nutch to index production log files from a client
> > > > > application.  It's been a great tool because we do get a large volume
> > > of
> > > > > logs from the field and often have to go through complicated pattern
> > > > > searches.  Lately we're have some issues managing the our disk
> > > space.  I
> > > > > noticed that nutch keeps all of the content in the segments content
> > > folder.
> > > > > Is there a reason all of the content is stored?  I didn't see any
> > > obvious
> > > > > setting for just indexing and not keeping the content.
> > > > >
> > > > > I do use the more search plugings to do filtering by date and
> > > url.  Maybe
> > > > > these require the content in the content folders?  Any help would be
> > > muchly
> > > > > appreciated.
> > > > >
> > > > > Roberto
> > > > >
> > > > >
> > > >
> > >
> >
> >
>


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] managing content size in segments folder

Reply via email to