|
Hi guys,
I have a very specific search engine need.
I have run across Nutch and it sounds very promising.
Thanks for your hard work on it.
Here is what I need to do:
I need to be able to search all the variables of a domain
URL text string.
An example of what I need to do is found in my home business on
the Internet.
The URL is www.RetireQuickly.com/62798 . www.RetireQuickly.com
being the Domain URL and the "/62798" being the variable of the
domain that is a peculiar ID to me. All of 50,000 + Retire Quickly
reps
have this URL www.RetireQuickly.com followed by 2
to currently 5
numbers that makes the replicated site peculiar to the Representative
that "owns" it. When I do search in Google for www.RetireQuickly.com
I get around 388 returns (many duplicated) when I know there should
be
at least 50,000 www.RetireQuickly.com with peculiar
rep ids.
Another example would be www.NewVision
.net followed a "/" an then numbers and/or
letters for the rep ID. My research indicates there is around 600,000
representatives
for the company that provides replicated sites for their representatives.
Yet
a Google search only turns up a few hundred with many duplicate links in
the
search.
So my question is this:
Does there currently exist a search utility (prepackaged) that will find
ALL the variations of
links to a search domain URL? Would Nutch be a good candidate for
this type of
search? Would it take a lot of scripting and/or programming to make it do
this?
The search that I need is simple even though the programming behind
it
may not be. I need to be able to search the entire Internet for all the
variations
of a specified domain while rejecting sites that duplicate peculiar links
that have
already been found. Obviously, this is an ongoing "harvesting"
project.
Any help or suggestions that you could give me would be most
appreciated.
Thank you,
Sam Peeples
423-265-7038
|
