Hi Jaap, It has worked previously for me with mysql. I am using Hbase now and everything is going quite well too.
I am gonna try working with mysql to solve this issue, i need little more details. Did you try to crawl nutch website or anything more ? Did you define index.parse.md in the nutch-site.xml and also the fields in the schema ? Did you restart Solr once you created the schema ? Which nutch version are you using ? Did you check the Solr logs ? Thank you, Kiran. On Tue, Jan 1, 2013 at 1:22 PM, J. Gobel (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541896#comment-13541896] > > J. Gobel commented on NUTCH-1511: > --------------------------------- > > Hi Kiran, > > I never got it to work in Solr4. No matter what I tried, the fields > metadata never shows up in Solr4. Do you index using HBase or Mysql? If > times allows, please try it with MYSQL. > > Just add the table below in MYSQL. Or alternatively for a more thorough > explanation check the guide on http://nlp.solutions.asia/?p=180 > > CREATE TABLE `webpage` ( > `id` varchar(767) NOT NULL, > `headers` blob, > `text` mediumtext DEFAULT NULL, > `status` int(11) DEFAULT NULL, > `markers` blob, > `parseStatus` blob, > `modifiedTime` bigint(20) DEFAULT NULL, > `score` float DEFAULT NULL, > `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, > `baseUrl` varchar(767) DEFAULT NULL, > `content` longblob, > `title` varchar(2048) DEFAULT NULL, > `reprUrl` varchar(767) DEFAULT NULL, > `fetchInterval` int(11) DEFAULT NULL, > `prevFetchTime` bigint(20) DEFAULT NULL, > `inlinks` mediumblob, > `prevSignature` blob, > `outlinks` mediumblob, > `fetchTime` bigint(20) DEFAULT NULL, > `retriesSinceFetch` int(11) DEFAULT NULL, > `protocolStatus` blob, > `signature` blob, > `metadata` blob, > PRIMARY KEY (`id`) > ) ENGINE=InnoDB > ROW_FORMAT=COMPRESSED > DEFAULT CHARSET=utf8mb4; > > rgds, > > Jaap > > > Metadata in MYSQL updated with 'garbage' > > ---------------------------------------- > > > > Key: NUTCH-1511 > > URL: https://issues.apache.org/jira/browse/NUTCH-1511 > > Project: Nutch > > Issue Type: Bug > > Components: fetcher > > Affects Versions: 2.1 > > Environment: Ubuntu 12.04 > > Reporter: J. Gobel > > Labels: metadata, mysql, nutch > > > > After applying patch for Metadata parser (NUTCH-1478) I notice that the > metadata field just before the crawl ends is populated with the correct > information. However when the crawl is completely finished the metadata > field is populated with 'garbage' _csh_ ����� > > last few lines of my logfile: > > p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 .. > > 013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature > impl: org.apache.nutch.crawl.MD5Signature > > 2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing > http://nutch.apache.com/ > > 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : > robots index, follow > > 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : > keywords .com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain > registry, nic, extention, icann > > 2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag : > description Registreer nu uw .com.nl of .net.nl extentie. > > 2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules > for scope 'outlink', using default > > 2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is > null in cleanup > > 2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader - > gora.buffer.read.limit = 10000 > > 2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter - > gora.buffer.write.limit = 10000 > > 2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > > 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000 > > 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 > > 2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is > null in cleanup > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators > For more information on JIRA, see: http://www.atlassian.com/software/jira > -- Kiran Chitturi

