Hi Chip,
If you have not had a look at the JIRA link provided by Julien then please
do, it makes more sense.
As you correctly identified, in many cases the pages are not owned by 'us',
therefore it is unlikely a web administrator will add meta tags willy nilly
as per our requests. As far as I am
Hi Julien,
Thanks for clarifying this! I've got it working now. Instead of seeding with a
proper tab-delimited file created in Excel, I had been wrong-headedly seeding
it with a text file that just had tabs in it. They look the same, but it makes
a difference. Thanks!
Chip
-Original
Hi!
I am trying to write additional metadata to my CrawlDB. This data has to be
extracted from the URLs BEFORE they get normalized (via regex
urlnormalizer).
Here are some sample URLs:
1.
https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2FSERVERID=ZF@@002@@ZF
2.
Hello,
I wondered if it is possible to restart a failed job in nutch-1.3 version.
I have this error
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/
after fetching for 5 days. I know the reason for the error, but do not want to
restart the
Hi,
You just ran out of luck. A failed fetch cannot be resumed in 1.x, the files
not appendable, write once/read many. It's also a good idea to run smaller
segments.
Cheers,
Hello,
I wondered if it is possible to restart a failed job in nutch-1.3 version.
I have this error
5 matches
Mail list logo