You'd have to do this in perl, c, or java.  I wouldn't suggest coldfusion.  
The only thing you can do if you can't get a list of .gov, etc. domains is 
seed the spider with an existing one you know of and hope there are links to 
others (there probably are), and visit those links too.  As for the word 
matching, it wouldn't be too hard, but you'd have to come up with many 
variations of patterns to match.  You're on the right track with RegEx 
matching, but I think this project isn't worth the time.  If this MUST be 
done, you might get lucky and get it done faster with a spider, but I highly 
doubt it.

: (
Blake Miller
[EMAIL PROTECTED]
www.crackheaded.com


----Original Message Follows----
From: "Hinojosa, Robert A" <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
Subject: OT: Project help.
Date: Wed, 29 Aug 2001 14:53:36 -0400

I have a potential project that I will be working on and I need some advice
as to how to go about it, if at all.

This project is for a government agency looking to ensure that government
sites follow the 1977 privacy act.  So what they are asking is to spider
through all of the .gov, .mil, .state.us sites and see whether or not they
are requesting individual personalized information that the government can
use to be able to track a single person by this identifier, which, according
to the act is illegal.

REQUESTS:
1. All websites that ask for Social Security Numbers, Medical ID, EFT #, and
so forth as input.
2. All websites that check for persistant cookies on these sites.
3. All websites that use advanced marketing techniques to track the user(ex.
doubleClick).

MY QUESTIONS:
1. Is there a way besides using network solutions to find a list of all the
.mil, .gov, and .state.us domains?  Could I use maybe a DNS server's
database for this information?

2.  Will this even be a feasable task in your opinion, especially for the
information requested in #1.  With the amount of forms, flash, server-side
validation, on these sites, do you think that there would be a way to report
a *RELIABLE* percentage of statistics on these sites?  I think request #2
and #3 are easy to look for.  This is what I'm so unsure of because SSN's
can be asked for in lots of ways.  Traversing a five-step form is nearly
impossible with server-side validation to drill down to where the ssn is
being asked for.

3.  What would be the best technology to use in such a scenario?  I wish I
could use CF, but I truly think this has to be written in Java or C++ for
multi-threadedness, of which, I'm only proficient in Java. Or unless you
think CF is the best for this.  Or a combination of both.

4.  Anyone know if Java has a Regular Expression Package?

Thanks for all your help,

Robert Hinojosa
[EMAIL PROTECTED]
972.243.4343 x7446


-------------------------------------------------------------------------
This email server is running an evaluation copy of the MailShield anti-
spam software. Please contact your email administrator if you have any
questions about this message. MailShield product info: www.mailshield.com

-----------------------------------------------
To post, send email to [EMAIL PROTECTED]
To subscribe / unsubscribe: http://www.dfwcfug.org


_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp


-------------------------------------------------------------------------
This email server is running an evaluation copy of the MailShield anti-
spam software. Please contact your email administrator if you have any
questions about this message. MailShield product info: www.mailshield.com

-----------------------------------------------
To post, send email to [EMAIL PROTECTED]
To subscribe / unsubscribe: http://www.dfwcfug.org

Reply via email to