I stuck a few log statements within ParseOutputFormat.java. One after
'String toUrl =' and another before the 'if (toUrl != null)'. Nutch came
across a URL which hit the first but not the second.

This means it is getting stuck (no exit or error, eventually the process
times out and is reattempted to fail exactly the same way).

The URL it is trying to process at the time is very long and somewhat
convoluted. The thread is idle. Adding a restriction to skip URLs longer
than 512 characters seems to have solved it.

4096 characters long
http://www.moveandstay.com/aberdeen/::abilene/::addison/::adelaide/::1076::1042::aix_en_provence/::alexandria/::algarve/::alpharetta/::1077::amalfi_coast/::amersham/::amsterdam/::arlington/::ashgrove/::atlanta/::1080::auckland/::austin/::707::bali/::1102::bangalore/::bangkok/::1037::barcelona/::beachwood/::bedminster/::beijing/::bellevue/::belo_horizonte/::berlin/::bethesda/::beverly_hills/::1068::1082::birmingham/::birmingham/::blois/::bloomfield_hills/::boca_raton/::bogota/::bohemia/::960::bonn/::bordeaux/::boston/::bothell/::brasilia/::1145::brest/::bridgewater/::brisbane/::bristol/::brookfield/::broomfield/::brussels/::budapest/::buffalo/::burlington/::cairns/::cambridge/::cambridge/::campbell/::campinas/::canberra/::cape_town/::1040::caracas/::cardiff/::1114::carlsbad/::carlton/::century_city/::cerritos/::1061::charlotte/::cheltenham/::1016::chicago/::chonburi/::christchurch/::308::cincinnati/::cleveland/::cologne/::compiegne/::1079::coral_gables/::costa_mesa/::crete/::culver_city/::curitiba/::1064::1098::1166::dallas/::dandenong/::darwin/::denver/::1063::doncaster/::dortmund/::dubai/::dublin/::dublin/::durham/::195::east_brunswick/::east_sicily/::edina/::edinburgh/::englewood/::erlanger/::essen/::fairfax/::farmington/::fitzroy/::florence/::1090::framingham/::frankfurt/::freehold/::frisco/::1127::979::glasgow/::glendale/::1133::gold_river/::1084::greenwood_village/::1091::guadalajara/::guangzhou/::1170::hamburg/::hanoi/::1132::hauppage/::henderson/::ho_chi_minh_city/::hobart/::hongkong/::houston/::huntington_beach/::1089::independence/::indianapolis/::1059::irvine/::irvine/::irving/::iselin/::671::162::jacksonville/::jakarta/::1113::jersey_city/::johannesburg/::jolimont/::kennesaw/::kew/::king_of_prussia/::kirkland/::krabi/::kuala_lumpur/::673::1185::la_jolla/::la_mirada/::1085::la_rochelle/::lago_maggiore/::laguna_hills/::1144::lake_oswego/::lannion/::1087::1159::las_vegas/::le_mans/::leeds/::lille/::lisbon/::lisle/::london/::long_beach/::los_angeles/::lyon/::1143::1021::963::madrid/::mahwah/::maidenhead/::1067::maitland/::1088::1025::manchester/::1081::mandurah/::manhattan_beach/::manila/::1078::732::1044::1105::marseille/::mclean/::melbourne/::melville/::mexico_city/::miami/::michigan/::milan/::458::minneapolis/::minnetonka/::monterrey/::montpellier/::montreal/::morristown/::1130::686::mountain_view/::mt._laurel/::mumbai/::munich/::nagoya/::nancy/::nantes/::naples/::narre_warren/::nashville/::new_delhi/::new_york/::newark/::newcastle/::newport_beach/::newtown/::199::norcross/::northbrook/::nottingham/::novi/::191::oak_brook/::oakbrook_terrace/::orange/::orlando/::osaka/::1186::overland_park/::131::palatine/::paris/::parnell/::parsippany/::pasadena/::pattaya/::1060::perth/::1120::philadelphia/::phoenix/::phuket/::pittsburgh/::plantation/::pleasanton/::ponsoby/::portland/::porto_alegre/::1123::positano/::prague/::prahran/::1106::princeton/::1058::puglia/::rancho_santa_margarita/::rayong/::reading/::red_bank/::redmond/::rennes/::reston/::rio_de_janeiro/::693::rolling_meadows/::rome/::rosemont/::roseville/::sacramento/::saddle_brook/::1072::saint-nazaire/::1083::salvador/::1115::1029::san_antonio/::san_diego/::san_francisco/::san_jose/::san_juan/::san_mateo/::san_rafael/::san_ramon/::santa_clara/::564::sao_polo/::1049::1062::1118::schaumburg/::scottsdale/::570::seattle/::seoul/::shanghai/::1160::short_hills/::1108::singapore/::sofia/::sophia_antipolis/::sorrento/::1117::579::southfield/::1066::st_kilda/::st_louis/::stockholm/::strasbourg/::192::sun_city/::sunrise/::surabaya/::sydney/::syosset/::1126::1158::tampa/::tarrytown/::taupo/::the_entrance/::tokyo/::toronto/::1069::toulouse/::trinity_beach/::troy/::tulsa/::tuscany_cities/::1055::tuscany_seaside/::tustin/::1134::umbria/::uniondale/::vancouver/::venice/::verona/::victoria/::vienna/::vienna/::1111::1110::walnut_creek/::waltham/::wantirna/::warrenville/::warsaw/::washington_dc/::1128::wellesley_hills/::wellington/::west_chester/::west_sicily/::white_plains/::wiesbaden/::williamstown/::windsor/::woodland_hills/::worthington/::948::zurich/


Index: ParseOutputFormat.java
===================================================================
--- ParseOutputFormat.java      (revision 344015)
+++ ParseOutputFormat.java      (working copy)
@@ -56,7 +56,7 @@
 
         public void write(WritableComparable key, Writable value)
           throws IOException {
-          
+
           Parse parse = (Parse)value;
           
           textOut.append(key, new ParseText(parse.getText()));
@@ -73,6 +73,10 @@
           for (int i = 0; i < links.length; i++) {
             String toUrl = links[i].getToUrl();
             try {
+              if (toUrl.length() > 512) {
+                 throw new Exception("URL length too long: " +
toUrl.length() +" characters");
+              }
+
               toUrl = urlNormalizer.normalize(toUrl); // normalize the
url
               toUrl = URLFilters.filter(toUrl);   // filter the url
             } catch (Exception e) {

-- 
Rod Taylor <[EMAIL PROTECTED]>



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.  Get Certified Today
Register for a JBoss Training Course.  Free Certification Exam
for All Training Attendees Through End of 2005. For more info visit:
http://ads.osdn.com/?ad_idv28&alloc_id845&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to