[jira] [Commented] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'

J. Gobel (JIRA) Tue, 01 Jan 2013 10:22:13 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541896#comment-13541896
 ]


J. Gobel commented on NUTCH-1511:
---------------------------------

Hi Kiran,

I never got it to work in Solr4. No matter what I tried, the fields metadata 
never shows up in Solr4. Do you index using HBase or Mysql? If times allows, 
please try it with MYSQL.

Just add the table below in MYSQL. Or alternatively for a more thorough 
explanation check the guide on http://nlp.solutions.asia/?p=180

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

rgds,

Jaap
                
> Metadata in MYSQL updated with 'garbage'
> ----------------------------------------
>
>                 Key: NUTCH-1511
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1511
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.1
>         Environment: Ubuntu 12.04
>            Reporter: J. Gobel
>              Labels: metadata, mysql, nutch
>
> After applying patch for Metadata parser (NUTCH-1478) I notice that the 
> metadata field just before the crawl ends is populated with the correct 
> information. However when the crawl is completely finished the metadata field 
> is populated with 'garbage' _csh_����� 
> last few lines of my logfile:
> p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 ..
> 013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl: 
> org.apache.nutch.crawl.MD5Signature
> 2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing 
> http://nutch.apache.com/
> 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots 
> index, follow
> 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords 
> .com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, 
> nic, extention, icann
> 2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag : 
> description Registreer nu uw .com.nl of .net.nl extentie.
> 2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for 
> scope 'outlink', using default
> 2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null 
> in cleanup
> 2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader - 
> gora.buffer.read.limit = 10000
> 2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter - 
> gora.buffer.write.limit = 10000
> 2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule 
> impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - 
> defaultInterval=2592000
> 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
> 2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null 
> in cleanup

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'

Reply via email to