txt filed keeps only beginning of BODY.
Original page is kept in field urlwordsNN.words packed by bzip.

Alexander.

John Pinochet �����(�):

> Note that I forgot to include the field txt in my post.  So, altogether I
> use the following SQL statement:
>
> SELECT url_id
> FROM urlwordsNN
> WHERE txt like '%porn%';
>
> In my WHERE clause I alternated between txt, keywords, description, and
> title.  Where are the words for "index body of HTML documents" stored?  What
> is the field name so I can use it in my WHERE clause?
>
> jp
> Santiago
>
> ----- Original Message -----
> From: "Kir Kolyshkin" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Friday, January 25, 2002 2:45 AM
> Subject: Re: [aseek-users] Problems with Spam Search Engine listings
>
> > Note that aspseek is a full-text search engine, so it index body of HTML
> documents. The
> > results you still found contains %porn% in body.
> >
> > > Any ideas?  Basically I'd like to know where aspseek derives results
> from besides title,
> > > keyword and description, as I've eliminated all sites with %porn% by
> collecting them into a
> > > space, and creating a second space with non porn sites.  Trouble is, the
> space with non
> > > porn sites still has porn sites...but none of those remaining porn sites
> has %porn% in the
> > > title, keyword or description.
> > >
> > > jp
> > > Santiago
> > >
> > >
> > >      ----- Original Message -----
> > >      From: John Pinochet
> > >      To: [EMAIL PROTECTED]
> > >      Sent: Monday, January 21, 2002 7:03 PM
> > >      Subject: [aseek-users] Problems with Spam Search Engine listings
> > >
> > >      I'm having problems getting rid of spam listings.
> > >
> > >      In particular porn.
> > >
> > >      I've come up with a list of words and a series of SQL statements to
> check for
> > >      their occurencs in urlwordsXX, etc etc, but there must be a better
> way.  "-" in
> > >      the query won't do it either as these people are very crafty.
> Besides, you can't
> > >      have a query with hundreds of 'minused' words.
> > >
> > >      Why isn't there a very simple way to eliminate sites via a "bad
> word" list?  Note
> > >      I'm not talking about prior to indexing.  I'm talking about post
> index.  Adult
> > >      word filter.
> > >
> > >      Also, even after I've eliminated all traces of %porn%, %Porn%, and
> %PORN% from
> > >      the database via a comparision query to urlwords00 - urlwords15
> (title,
> > >      description, keywords), I still have thousands of websites with
> %porn%, %PORN%,
> > >      and %Porn%, albeit none of the remaining websites have that in
> their title,
> > >      description, or keywords, so at least my 'cleaning' is almost
> working.
> > >
> > >      Where is this string occuring then if not in title, description, or
> keywords?
> > >
> > >      Note that for testing purposes all I did was create two webspaces:
> one porn free
> > >      (%porn% not found in keywords, description, or title) and the other
> only porn.
> > >      When I search the porn free space, I STILL have occurences of the
> above string.
> > >
> > >      jp
> > >      Santiago
> >
> > --
> > [EMAIL PROTECTED]  ICQ 7551596  Phone +7 903 6722750
> > Hard work may not kill you,  but why take chances?
> > --
> >

Reply via email to