Is it possible to use the webdb (link db) to add rules to as you recursively process your data based upon link analysis?
Such as adding a field for language so you can either score based on language (using language as your identifying "top" to build your link analysis from) or simply ignoring a language/tld or some field all together from generate process (essentially not refetching or pragmatically refetching or scaling your fetch on some priority)? does google have a page rank based upon language or is it based upon entire dataset? From what i can find out temoa says there are 2 billion english only pages and the rest are non english; is that a safe assumption or just what teoma has been able to process? I ask that because it would mean simplifying your processing to manageable datasets for many languages compared to trying the "whole internet" in a single dataset. I also think it's more feasable for me to focus on english based sites as a whole since the cultural differences and laws are enough to shy me away from getting into the legal messes other nations can potentially enforce or imply :) -byron --- Byron Miller <[EMAIL PROTECTED]> wrote: > I know you can enable language detect during > index-more however is there a method to doing this > during the crawl? > > I'm interested in building an index as english only > right now. what is the theory behind that? anyone > have > any experience? > > would it be building a huge black list, ignoring > tlds > until you find a computational method or??? thoughts > anyone? > ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
