Re: Machine readable vs. human readable URLs.

2011-09-20 Thread lewis john mcgibbney
Hi Chip, If you have not had a look at the JIRA link provided by Julien then please do, it makes more sense. As you correctly identified, in many cases the pages are not owned by 'us', therefore it is unlikely a web administrator will add meta tags willy nilly as per our requests. As far as I am

RE: Machine readable vs. human readable URLs.

2011-09-20 Thread Chip Calhoun
Hi Julien, Thanks for clarifying this! I've got it working now. Instead of seeding with a proper tab-delimited file created in Excel, I had been wrong-headedly seeding it with a text file that just had tabs in it. They look the same, but it makes a difference. Thanks! Chip -Original

Extract data form URL before normalization

2011-09-20 Thread Alexander Fahlke
Hi! I am trying to write additional metadata to my CrawlDB. This data has to be extracted from the URLs BEFORE they get normalized (via regex urlnormalizer). Here are some sample URLs: 1. https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2FSERVERID=ZF@@002@@ZF 2.

restart a failed job

2011-09-20 Thread alxsss
Hello, I wondered if it is possible to restart a failed job in nutch-1.3 version. I have this error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/ after fetching for 5 days. I know the reason for the error, but do not want to restart the

Re: restart a failed job

2011-09-20 Thread Markus Jelsma
Hi, You just ran out of luck. A failed fetch cannot be resumed in 1.x, the files not appendable, write once/read many. It's also a good idea to run smaller segments. Cheers, Hello, I wondered if it is possible to restart a failed job in nutch-1.3 version. I have this error