Ok that's what I figured, just thought I'd check to see if there were any
url Normalizer tricks I could do.  I'll just add a new field to the
BasicIndexingFilter where I replace the local file:/ url with my publicUrl
and display that on the search.jsp page.  I think this would be a good
general feature. Many intranets have localized file repositories which would
be more effecitive to search locally but make them available via an http
service.  My logs repository has a new folder for each day so I just have a
cron feed the file urls directly into nutch (no need to crawl).

Thanks again,

roberto


On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote:

Likely org.apache.nutch.net.RegexUrlNormalizer will also change the
URL in the database, thus affecting (re)fetching of your log files.
Thus this might not be the way to go.

Instead you might want to change the BasicIndexingFilter where the URL
is indexed so that the change only affects the value stored in the
Index - which is what you want.

You have to figure it out, basically :)

Rgrds, Thomas

On 6/17/06, TDLN <[EMAIL PROTECTED]> wrote:
> Yes, correct, summary is retrieved from parse_text and not from the
> content so it is not affected by this property. Should have checked
> that, but I am happy that is works so far.
>
> Regarding the URL: I am not 100% sure it will serve your needs, but
> you can investigate usage of the
> org.apache.nutch.net.RegexUrlNormalizer.
>
> Rgrds, Thomas
>
> On 6/17/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> > Thanks, it didn't see the fetcher.store.content attribute in 0.7 so i
> > updated to .8-dev.  It worked as advertised, it seems like summary and
> > search context still work.  The only thing affected was the cached
view of
> > the file.  I didn't limit  http.content.limit because i do want all of
the
> > log files indexed.
> >
> > My log files are on one servers filesystem, I want to index them via a
local
> > search fie:///logs but then present the url link as coming from an
http root
> > so that other users can fetch the files.  Currently if fetch them from
the
> > local webserver but that's a little inneffient since i know where the
files
> > are locally on the FS.  Has anyone done a local search but used http
urls
> > for the search results?
> >
> > I could modify search.jsp to replace my file:// root with an http
root, but
> > that seems a little hacky.  Does anyone know if there is a regex-url
filter
> > for post processing of the link urls?  I tried using the regex-url
filter
> > but it modified the url before the fetcher used it.  I want to modify
via
> > regex when entered into the url index or when displayed.
> >
> > Thanks,
> >
> > -roberto
> >
> >
> > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
> > >
> > > I mean disable the cache link in the search.jsp.
> > >
> > > On 6/15/06, TDLN <[EMAIL PROTECTED]> wrote:
> > > > As far as I know, content in the segments is used to generate the
> > > > summary in the search results and off course for the cache
feature.
> > > >
> > > > If you don't need these you can adjust the fetcher.store.contentand
> > > > http.content.limit config properties. Also you might have to
change
> > > > search.jsp.
> > > >
> > > > Rgrds, Thomas
> > > >
> > > > On 6/15/06, Roberto Monge <[EMAIL PROTECTED]> wrote:
> > > > > I've been using nutch to index production log files from a
client
> > > > > application.  It's been a great tool because we do get a large
volume
> > > of
> > > > > logs from the field and often have to go through complicated
pattern
> > > > > searches.  Lately we're have some issues managing the our disk
> > > space.  I
> > > > > noticed that nutch keeps all of the content in the segments
content
> > > folder.
> > > > > Is there a reason all of the content is stored?  I didn't see
any
> > > obvious
> > > > > setting for just indexing and not keeping the content.
> > > > >
> > > > > I do use the more search plugings to do filtering by date and
> > > url.  Maybe
> > > > > these require the content in the content folders?  Any help
would be
> > > muchly
> > > > > appreciated.
> > > > >
> > > > > Roberto
> > > > >
> > > > >
> > > >
> > >
> >
> >
>

_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to