[
https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
J. Gobel updated NUTCH-1511:
----------------------------
Description:
After applying patch for Metadata parser (NUTCH-1478) I notice that the
metadata field just before the crawl ends is populated with the correct
information. However when the crawl is completely finished the metadata field
is populated with 'garbage' _csh_�����
I notice in my SQL log file that the scoring plugin is updating the metadata
field with '_csh_ \0\0\0\0\'. When I remove 'scoring-opic' out of
'plugin.includes' property in the nutch-site.xml , the metadata-field is crisp
and clear.
MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see a
fragments of my MYSQL log file, only the moments when data is written to the
METADATA field in the MYSQL table.
First Insertion .. here I suppose scoring-opic writes its information, _csh_
?€\0\0\0
58 Query INSERT INTO webpage
(fetchInterval,fetchTime,id,markers,metadata,score )VALUES
(2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0','
_csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE
fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 _injmrk_
y\0',metadata='
_csh_ ?€\0\0\0',score=1.0
Second Insertion - inhere scraped metada is inserted into metadata.
81 Query INSERT INTO webpage
(id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES
('org.apache.nutch:http/',
The final insertion - please note that here the metadata field is updated with
_CSH_\0\0\0\0
90 Query INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata )VALUES
(1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/
Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508
__prsmrk__*1357122982-1745626508 _gnmrk_*1357122982-1745626508
_ftcmrk_*1357122982-1745626508\0','
_csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks='
0http://nutch.apache.org/
was:
After applying patch for Metadata parser (NUTCH-1478) I notice that the
metadata field just before the crawl ends is populated with the correct
information. However when the crawl is completely finished the metadata field
is populated with 'garbage' _csh_�����
last few lines of my logfile:
p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 ..
Update: I notice in my SQL log file that the scoring plugin is updating the
metadata field with '_csh_ \0\0\0\0\'. When I remove 'scoring-opic' out of
'plugin.includes' property in the nutch-site.xml , the metadata-field is crisp
and clear.
013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing http://nutch.apache.com/
2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots
index, follow
2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords
.com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, nic,
extention, icann
2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag :
description Registreer nu uw .com.nl of .net.nl extentie.
2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for
scope 'outlink', using default
2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null
in cleanup
2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null
in cleanup
> Metadata in MYSQL updated with 'garbage'
> ----------------------------------------
>
> Key: NUTCH-1511
> URL: https://issues.apache.org/jira/browse/NUTCH-1511
> Project: Nutch
> Issue Type: Bug
> Components: fetcher, injector, storage
> Affects Versions: 2.1
> Environment: Ubuntu 12.04
> Reporter: J. Gobel
> Labels: metadata, mysql, nutch, scoring-opic
>
> After applying patch for Metadata parser (NUTCH-1478) I notice that the
> metadata field just before the crawl ends is populated with the correct
> information. However when the crawl is completely finished the metadata field
> is populated with 'garbage' _csh_�����
> I notice in my SQL log file that the scoring plugin is updating the metadata
> field with '_csh_ \0\0\0\0\'. When I remove 'scoring-opic' out of
> 'plugin.includes' property in the nutch-site.xml , the metadata-field is
> crisp and clear.
> MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see
> a fragments of my MYSQL log file, only the moments when data is written to
> the METADATA field in the MYSQL table.
> First Insertion .. here I suppose scoring-opic writes its information, _csh_
> ?€\0\0\0
> 58 Query INSERT INTO webpage
> (fetchInterval,fetchTime,id,markers,metadata,score )VALUES
> (2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0','
> _csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE
> fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 _injmrk_
> y\0',metadata='
> _csh_ ?€\0\0\0',score=1.0
> Second Insertion - inhere scraped metada is inserted into metadata.
> 81 Query INSERT INTO webpage
> (id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES
> ('org.apache.nutch:http/',
> The final insertion - please note that here the metadata field is updated
> with _CSH_\0\0\0\0
> 90 Query INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata
> )VALUES (1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/
> Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508
> __prsmrk__*1357122982-1745626508 _gnmrk_*1357122982-1745626508
> _ftcmrk_*1357122982-1745626508\0','
> _csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks='
> 0http://nutch.apache.org/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira