[jira] [Updated] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'

J. Gobel (JIRA) Wed, 02 Jan 2013 07:44:12 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


J. Gobel updated NUTCH-1511:
----------------------------

    Description: 
After applying patch for Metadata parser (NUTCH-1478) I notice that the 
metadata field just before the crawl ends is populated with the correct 
information. However when the crawl is completely finished the metadata field 
is populated with 'garbage' _csh_����� 

last few lines of my logfile:
p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 ..

Update: I notice in my SQL log file that the scoring plugin is updating the 
metadata field with '_csh_ \0\0\0\0\'. When I remove 'scoring-opic' out of 
'plugin.includes' property in the nutch-site.xml , the metadata-field is crisp 
and clear.


013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl: 
org.apache.nutch.crawl.MD5Signature
2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing http://nutch.apache.com/
2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots 
index, follow
2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords 
.com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, nic, 
extention, icann
2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag : 
description Registreer nu uw .com.nl of .net.nl extentie.
2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for 
scope 'outlink', using default
2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null 
in cleanup
2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader - 
gora.buffer.read.limit = 10000
2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 10000
2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null 
in cleanup

  was:
After applying patch for Metadata parser (NUTCH-1478) I notice that the 
metadata field just before the crawl ends is populated with the correct 
information. However when the crawl is completely finished the metadata field 
is populated with 'garbage' _csh_����� 

last few lines of my logfile:
p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 ..

013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl: 
org.apache.nutch.crawl.MD5Signature
2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing http://nutch.apache.com/
2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots 
index, follow
2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords 
.com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, nic, 
extention, icann
2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag : 
description Registreer nu uw .com.nl of .net.nl extentie.
2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for 
scope 'outlink', using default
2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null 
in cleanup
2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader - 
gora.buffer.read.limit = 10000
2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 10000
2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null 
in cleanup

    
> Metadata in MYSQL updated with 'garbage'
> ----------------------------------------
>
>                 Key: NUTCH-1511
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1511
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.1
>         Environment: Ubuntu 12.04
>            Reporter: J. Gobel
>              Labels: metadata, mysql, nutch
>
> After applying patch for Metadata parser (NUTCH-1478) I notice that the 
> metadata field just before the crawl ends is populated with the correct 
> information. However when the crawl is completely finished the metadata field 
> is populated with 'garbage' _csh_����� 
> last few lines of my logfile:
> p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 ..
> Update: I notice in my SQL log file that the scoring plugin is updating the 
> metadata field with '_csh_ \0\0\0\0\'. When I remove 'scoring-opic' out of 
> 'plugin.includes' property in the nutch-site.xml , the metadata-field is 
> crisp and clear.
> 013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl: 
> org.apache.nutch.crawl.MD5Signature
> 2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing 
> http://nutch.apache.com/
> 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots 
> index, follow
> 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords 
> .com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, 
> nic, extention, icann
> 2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag : 
> description Registreer nu uw .com.nl of .net.nl extentie.
> 2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for 
> scope 'outlink', using default
> 2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null 
> in cleanup
> 2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader - 
> gora.buffer.read.limit = 10000
> 2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter - 
> gora.buffer.write.limit = 10000
> 2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule 
> impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - 
> defaultInterval=2592000
> 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
> 2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null 
> in cleanup

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'

Reply via email to