Re: managing content size in segments folder

TDLN Fri, 23 Jun 2006 11:06:26 -0700

If you think this is a usefull feature for others you can create a
custom IndexingPlugin (instead of adding it to BasicIndexingFilter),
create a Patch file and attach the patch to a JIRA issue. Others can
then vote for it.


It *is* an open source project, you know :)

Rgrds, Thomas

On 6/19/06, Roberto Monge <[EMAIL PROTECTED]> wrote:

Ok that's what I figured, just thought I'd check to see if there were any
url Normalizer tricks I could do.  I'll just add a new field to the
BasicIndexingFilter where I replace the local file:/ url with my publicUrl
and display that on the search.jsp page.  I think this would be a good
general feature. Many intranets have localized file repositories which would
be more effecitive to search locally but make them available via an http
service.  My logs repository has a new folder for each day so I just have a
cron feed the file urls directly into nutch (no need to crawl).

Thanks again,

roberto


On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote:
>
> Likely org.apache.nutch.net.RegexUrlNormalizer will also change the
> URL in the database, thus affecting (re)fetching of your log files.
> Thus this might not be the way to go.
>
> Instead you might want to change the BasicIndexingFilter where the URL
> is indexed so that the change only affects the value stored in the
> Index - which is what you want.
>
> You have to figure it out, basically :)
>
> Rgrds, Thomas
>
> On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote:
> > Yes, correct, summary is retrieved from parse_text and not from the
> > content so it is not affected by this property. Should have checked
> > that, but I am happy that is works so far.
> >
> > Regarding the URL: I am not 100% sure it will serve your needs, but
> > you can investigate usage of the
> > org.apache.nutch.net.RegexUrlNormalizer.
> >
> > Rgrds, Thomas
> >
> > On 6/17/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> > > Thanks, it didn't see the fetcher.store.content attribute in 0.7 so i
> > > updated to .8-dev.  It worked as advertised, it seems like summary and
> > > search context still work.  The only thing affected was the cached
> view of
> > > the file.  I didn't limit  http.content.limit because i do want all of
> the
> > > log files indexed.
> > >
> > > My log files are on one servers filesystem, I want to index them via a
> local
> > > search fie:///logs but then present the url link as coming from an
> http root
> > > so that other users can fetch the files.  Currently if fetch them from
> the
> > > local webserver but that's a little inneffient since i know where the
> files
> > > are locally on the FS.  Has anyone done a local search but used http
> urls
> > > for the search results?
> > >
> > > I could modify search.jsp to replace my file:// root with an http
> root, but
> > > that seems a little hacky.  Does anyone know if there is a regex-url
> filter
> > > for post processing of the link urls?  I tried using the regex-url
> filter
> > > but it modified the url before the fetcher used it.  I want to modify
> via
> > > regex when entered into the url index or when displayed.
> > >
> > > Thanks,
> > >
> > > -roberto
> > >
> > >
> > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
> > > >
> > > > I mean disable the cache link in the search.jsp.
> > > >
> > > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
> > > > > As far as I know, content in the segments is used to generate the
> > > > > summary in the search results and off course for the cache
> feature.
> > > > >
> > > > > If you don't need these you can adjust the fetcher.store.contentand
> > > > > http.content.limit config properties. Also you might have to
> change
> > > > > search.jsp.
> > > > >
> > > > > Rgrds, Thomas
> > > > >
> > > > > On 6/15/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> > > > > > I've been using nutch to index production log files from a
> client
> > > > > > application.  It's been a great tool because we do get a large
> volume
> > > > of
> > > > > > logs from the field and often have to go through complicated
> pattern
> > > > > > searches.  Lately we're have some issues managing the our disk
> > > > space.  I
> > > > > > noticed that nutch keeps all of the content in the segments
> content
> > > > folder.
> > > > > > Is there a reason all of the content is stored?  I didn't see
> any
> > > > obvious
> > > > > > setting for just indexing and not keeping the content.
> > > > > >
> > > > > > I do use the more search plugings to do filtering by date and
> > > > url.  Maybe
> > > > > > these require the content in the content folders?  Any help
> would be
> > > > muchly
> > > > > > appreciated.
> > > > > >
> > > > > > Roberto
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>

Re: managing content size in segments folder

Reply via email to