[
https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575181#comment-13575181
]
Roland commented on NUTCH-1511:
-------------------------------
Hi Lewis,
I'm not absolutly sure what you mean.
Actually I'm running a crawl job for 400k pages, so I can't reproduce this, but
I can try to describe:
Now running without scoring-opic plugin I get this from cassandra:
{code}
[default@webpage] get
sc['de.welt.www:http/wams_print/article1466267/Agravis-sucht-Investoren.html'][mtdt];
=> (column=_csh_, value=<f�, timestamp=1360365320407004)
=> (column=author, value=Guido Hartmann , timestamp=1360365320408000)
=> (column=date, value=2007-12-16T04:00:00+01:00 ,
timestamp=1360365320408004)
=> (column=description, value=Der Handelskonzern aus Münster setzt im Jahr mit
landwirtschaftlichen Produkten und Landtechnik knapp vier Milliarden Euro um.
Doch die Gewinne sind bislang mager. Gleichzeitig steigen die Risiken im ,
timestamp=1360365320409000)
=> (column=last-modified, value=2011-11-16T07:35:26+01:00 ,
timestamp=1360365320408002)
=> (column=location, value=welt.de , timestamp=1360365320407000)
=> (column=robots, value=index,follow,noodp , timestamp=1360365320407002)
{code}
from an old run (nutch with scorint-opic + hbase) I have this data:
{code}
hbase(main):006:0> get 'webpage',
'de.welt.www:http/wams_print/article1466267/Agravis-sucht-Investoren.html',
'mtdt'
COLUMN CELL
mtdt:_csh_ timestamp=1360335641708, value=1U#L
{code}
I can do another exact problem report with cassandra later.
--Roland
> Metadata in MYSQL updated with 'garbage'
> ----------------------------------------
>
> Key: NUTCH-1511
> URL: https://issues.apache.org/jira/browse/NUTCH-1511
> Project: Nutch
> Issue Type: Bug
> Components: fetcher, injector, storage
> Affects Versions: 2.1
> Environment: Ubuntu 12.04
> Reporter: J. Gobel
> Labels: metadata, mysql, nutch, scoring-opic
> Fix For: 2.2
>
>
> After applying patch for Metadata parser (NUTCH-1478) I notice that the
> metadata field just before the crawl ends is populated with the correct
> information. However when the crawl is completely finished the metadata field
> is populated with 'garbage' _csh_�����
> I notice in my SQL log file that the scoring plugin is overwriting the
> metadata field in a final data insertion with '_csh_ \0\0\0\0\'. When I
> remove 'scoring-opic' out of 'plugin.includes' property in the nutch-site.xml
> , the metadata-field is crisp and clear.
> MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see
> a fragments of my MYSQL log file, only the moments when data is written to
> the METADATA field in the MYSQL table.
> First Insertion .. here I suppose scoring-opic writes its information, _csh_
> ?€\0\0\0
> 58 Query INSERT INTO webpage
> (fetchInterval,fetchTime,id,markers,metadata,score )VALUES
> (2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0','
> _csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE
> fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 _injmrk_
> y\0',metadata='
> _csh_ ?€\0\0\0',score=1.0
> Second Insertion - inhere scraped metada is inserted into metadata.
> 81 Query INSERT INTO webpage
> (id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES
> ('org.apache.nutch:http/',
> The final insertion - please note that here the metadata field is
> overwritten with _CSH_\0\0\0\0
> 90 Query INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata
> )VALUES (1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/
> Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508
> __prsmrk__*1357122982-1745626508 _gnmrk_*1357122982-1745626508
> _ftcmrk_*1357122982-1745626508\0','
> _csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks='
> 0http://nutch.apache.org/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira