Re: [Nutch-general] managing content size in segments folder

TDLN Fri, 23 Jun 2006 11:06:30 -0700

If you think this is a usefull feature for others you can create a
custom IndexingPlugin (instead of adding it to BasicIndexingFilter),
create a Patch file and attach the patch to a JIRA issue. Others can
then vote for it.


It *is* an open source project, you know :)

Rgrds, Thomas

On 6/19/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> Ok that's what I figured, just thought I'd check to see if there were any
> url Normalizer tricks I could do.  I'll just add a new field to the
> BasicIndexingFilter where I replace the local file:/ url with my publicUrl
> and display that on the search.jsp page.  I think this would be a good
> general feature. Many intranets have localized file repositories which would
> be more effecitive to search locally but make them available via an http
> service.  My logs repository has a new folder for each day so I just have a
> cron feed the file urls directly into nutch (no need to crawl).
>
> Thanks again,
>
> roberto
>
>
> On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote:
> >
> > Likely org.apache.nutch.net.RegexUrlNormalizer will also change the
> > URL in the database, thus affecting (re)fetching of your log files.
> > Thus this might not be the way to go.
> >
> > Instead you might want to change the BasicIndexingFilter where the URL
> > is indexed so that the change only affects the value stored in the
> > Index - which is what you want.
> >
> > You have to figure it out, basically :)
> >
> > Rgrds, Thomas
> >
> > On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote:
> > > Yes, correct, summary is retrieved from parse_text and not from the
> > > content so it is not affected by this property. Should have checked
> > > that, but I am happy that is works so far.
> > >
> > > Regarding the URL: I am not 100% sure it will serve your needs, but
> > > you can investigate usage of the
> > > org.apache.nutch.net.RegexUrlNormalizer.
> > >
> > > Rgrds, Thomas
> > >
> > > On 6/17/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> > > > Thanks, it didn't see the fetcher.store.content attribute in 0.7 so i
> > > > updated to .8-dev.  It worked as advertised, it seems like summary and
> > > > search context still work.  The only thing affected was the cached
> > view of
> > > > the file.  I didn't limit  http.content.limit because i do want all of
> > the
> > > > log files indexed.
> > > >
> > > > My log files are on one servers filesystem, I want to index them via a
> > local
> > > > search fie:///logs but then present the url link as coming from an
> > http root
> > > > so that other users can fetch the files.  Currently if fetch them from
> > the
> > > > local webserver but that's a little inneffient since i know where the
> > files
> > > > are locally on the FS.  Has anyone done a local search but used http
> > urls
> > > > for the search results?
> > > >
> > > > I could modify search.jsp to replace my file:// root with an http
> > root, but
> > > > that seems a little hacky.  Does anyone know if there is a regex-url
> > filter
> > > > for post processing of the link urls?  I tried using the regex-url
> > filter
> > > > but it modified the url before the fetcher used it.  I want to modify
> > via
> > > > regex when entered into the url index or when displayed.
> > > >
> > > > Thanks,
> > > >
> > > > -roberto
> > > >
> > > >
> > > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > I mean disable the cache link in the search.jsp.
> > > > >
> > > > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
> > > > > > As far as I know, content in the segments is used to generate the
> > > > > > summary in the search results and off course for the cache
> > feature.
> > > > > >
> > > > > > If you don't need these you can adjust the fetcher.store.contentand
> > > > > > http.content.limit config properties. Also you might have to
> > change
> > > > > > search.jsp.
> > > > > >
> > > > > > Rgrds, Thomas
> > > > > >
> > > > > > On 6/15/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> > > > > > > I've been using nutch to index production log files from a
> > client
> > > > > > > application.  It's been a great tool because we do get a large
> > volume
> > > > > of
> > > > > > > logs from the field and often have to go through complicated
> > pattern
> > > > > > > searches.  Lately we're have some issues managing the our disk
> > > > > space.  I
> > > > > > > noticed that nutch keeps all of the content in the segments
> > content
> > > > > folder.
> > > > > > > Is there a reason all of the content is stored?  I didn't see
> > any
> > > > > obvious
> > > > > > > setting for just indexing and not keeping the content.
> > > > > > >
> > > > > > > I do use the more search plugings to do filtering by date and
> > > > > url.  Maybe
> > > > > > > these require the content in the content folders?  Any help
> > would be
> > > > > muchly
> > > > > > > appreciated.
> > > > > > >
> > > > > > > Roberto
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
>
>

Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] managing content size in segments folder

Reply via email to