Yes, the document creation is out of  my hands.

And in addition, the html documents mayn't be from single web site. The
number of
websites are dynamic.  Even in the case of single web site, there are
different apps each having its own layout etc.

So, am not sure if longest common prefix/suffix would work.  Any further
thoughts on this ?

thanks & regards
Ramesh

----- Original Message -----
From: "petite_abeille" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, December 30, 2002 7:36 PM
Subject: Re: Heuristics on searching HTML Documents ?


>
> On Monday, Dec 30, 2002, at 15:01 Europe/Zurich, Erik Hatcher wrote:
>
> > If you have control over the HTML, how about marking the navbar pieces
> > with a certain CSS class and then filtering that out from what you
> > index?  It seems like that would be a reasonable way to filter it -
> > but this is of course provided its your HTML and not someone elses.
>
> Alternatively, if the documents creation is out of your hands, you
> could try to compute the longest common prefix/suffix of a set of
> document and discount that from your indexing.
>
> PA.
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to