On Monday 23 February 2004 03:02 pm, Axel IS Main wrote: > That's not bad, but I found a way to do it simply using chr() and > passing it a value. It turns out the if I go 0-31 Almost nothing will > get through. Even the simples html has something in there from that > list. However, by just looking between 14 and 26, one more than carriage > return, and one less than escape, it worked really well. I crawled a > site with a large number of jpg, gif, mp3, wav, and pdf files. Of the > 100's of binaries there only one pdf got through. Not a bad record. I
It should be noted that PDF isn't necessarily a binary. It's just most people like to use compression, and embed images, sounds, etc. But if you want to, you can fire up emacs and create a PDF from scratch. So really the record is better than you think ;) > also found that in order for this to work I have to process the URLs. > This makes things really slow so I'm going to have to use both this and > the "check for extension" function together. Still, I can worry a lot > less about getting my index weighted down by binary files. The code is > pretty basic at this point, but here it is: > > // Check for binaries > $ckbin = 14; > while($ckbin <= 26){ > $ck = chr($ckbin); > $cbin = substr_count($read, $ck); > if($cbin > 0){ > echo "Killing off binary file URL: $url\n"; > $kill = mysql_unbuffered_query("DELETE FROM search WHERE > url_id='$url_id'"); > continue 2; > } > ++$ckbin; > } > I know it looks kind of funky out of context, but it works really great. > > Nick > > Richard Davey wrote: > >Hello Evan, > > > >Monday, February 23, 2004, 8:57:43 PM, you wrote: > >>>It would be wise to check for characters from 0 to 31, if they appear > >>>then it's almost certainly (but not guaranteed) binary. > > > >EN> Assuming that's decimal, you're including 0x09 0x0a and 0x0d which > > are, EN> respectively, tab, line feed, and carriage return. That's off > > the top of my EN> head, which means two things: (1) i may be forgetting > > something, and (2) I EN> need a life ;) > > > >Let me rephrase - check for the existence of characters 0 through 31 > >and count how many there are. Set a percentage weight yourself and > >figure out in your script if you deem the quantity too many or too > >few. > > > >The count_chars() function will be absolutely ideal for this. -- Evan Nemerson [EMAIL PROTECTED] http://coeusgroup.com/en -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php