txt filed keeps only beginning of BODY. Original page is kept in field urlwordsNN.words packed by bzip.
Alexander. John Pinochet �����(�): > Note that I forgot to include the field txt in my post. So, altogether I > use the following SQL statement: > > SELECT url_id > FROM urlwordsNN > WHERE txt like '%porn%'; > > In my WHERE clause I alternated between txt, keywords, description, and > title. Where are the words for "index body of HTML documents" stored? What > is the field name so I can use it in my WHERE clause? > > jp > Santiago > > ----- Original Message ----- > From: "Kir Kolyshkin" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Friday, January 25, 2002 2:45 AM > Subject: Re: [aseek-users] Problems with Spam Search Engine listings > > > Note that aspseek is a full-text search engine, so it index body of HTML > documents. The > > results you still found contains %porn% in body. > > > > > Any ideas? Basically I'd like to know where aspseek derives results > from besides title, > > > keyword and description, as I've eliminated all sites with %porn% by > collecting them into a > > > space, and creating a second space with non porn sites. Trouble is, the > space with non > > > porn sites still has porn sites...but none of those remaining porn sites > has %porn% in the > > > title, keyword or description. > > > > > > jp > > > Santiago > > > > > > > > > ----- Original Message ----- > > > From: John Pinochet > > > To: [EMAIL PROTECTED] > > > Sent: Monday, January 21, 2002 7:03 PM > > > Subject: [aseek-users] Problems with Spam Search Engine listings > > > > > > I'm having problems getting rid of spam listings. > > > > > > In particular porn. > > > > > > I've come up with a list of words and a series of SQL statements to > check for > > > their occurencs in urlwordsXX, etc etc, but there must be a better > way. "-" in > > > the query won't do it either as these people are very crafty. > Besides, you can't > > > have a query with hundreds of 'minused' words. > > > > > > Why isn't there a very simple way to eliminate sites via a "bad > word" list? Note > > > I'm not talking about prior to indexing. I'm talking about post > index. Adult > > > word filter. > > > > > > Also, even after I've eliminated all traces of %porn%, %Porn%, and > %PORN% from > > > the database via a comparision query to urlwords00 - urlwords15 > (title, > > > description, keywords), I still have thousands of websites with > %porn%, %PORN%, > > > and %Porn%, albeit none of the remaining websites have that in > their title, > > > description, or keywords, so at least my 'cleaning' is almost > working. > > > > > > Where is this string occuring then if not in title, description, or > keywords? > > > > > > Note that for testing purposes all I did was create two webspaces: > one porn free > > > (%porn% not found in keywords, description, or title) and the other > only porn. > > > When I search the porn free space, I STILL have occurences of the > above string. > > > > > > jp > > > Santiago > > > > -- > > [EMAIL PROTECTED] ICQ 7551596 Phone +7 903 6722750 > > Hard work may not kill you, but why take chances? > > -- > >
