On Monday 23 February 2004 03:02 pm, Axel IS Main wrote:
> That's not bad, but I found a way to do it simply using chr() and
> passing it a value. It turns out the if I go 0-31 Almost nothing will
> get through. Even the simples html has something in there from that
> list. However, by just looking between 14 and 26, one more than carriage
> return, and one less than escape, it worked really well. I crawled a
> site with a large number of jpg, gif, mp3, wav, and pdf files. Of the
> 100's of binaries there only one pdf got through. Not a bad record. I

It should be noted that PDF isn't necessarily a binary. It's just most people 
like to use compression, and embed images, sounds, etc. But if you want to, 
you can fire up emacs and create a PDF from scratch. So really the record is 
better than you think ;)

> also found that in order for this to work I have to process the URLs.
> This makes things really slow so I'm going to have to use both this and
> the "check for extension" function together. Still, I can worry a lot
> less about getting my index weighted down by binary files. The code is
> pretty basic at this point, but here it is:
>
>     // Check for binaries
>         $ckbin = 14;
>         while($ckbin <= 26){
>             $ck = chr($ckbin);
>             $cbin = substr_count($read, $ck);
>             if($cbin > 0){
>                 echo "Killing off binary file URL: $url\n";
>                 $kill = mysql_unbuffered_query("DELETE FROM search WHERE
> url_id='$url_id'");
>                 continue 2;
>             }
>         ++$ckbin;
>         }
> I know it looks kind of funky out of context, but it works really great.
>
> Nick
>
> Richard Davey wrote:
> >Hello Evan,
> >
> >Monday, February 23, 2004, 8:57:43 PM, you wrote:
> >>>It would be wise to check for characters from 0 to 31, if they appear
> >>>then it's almost certainly (but not guaranteed) binary.
> >
> >EN> Assuming that's decimal, you're including 0x09 0x0a and 0x0d which
> > are, EN> respectively, tab, line feed, and carriage return. That's off
> > the top of my EN> head, which means two things: (1) i may be forgetting
> > something, and (2) I EN> need a life ;)
> >
> >Let me rephrase - check for the existence of characters 0 through 31
> >and count how many there are. Set a percentage weight yourself and
> >figure out in your script if you deem the quantity too many or too
> >few.
> >
> >The count_chars() function will be absolutely ideal for this.

-- 
Evan Nemerson
[EMAIL PROTECTED]
http://coeusgroup.com/en

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to