[ 
https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575317#comment-13575317
 ] 

Lewis John McGibbney commented on NUTCH-1511:
---------------------------------------------

In Cassandra 1.0.2 with Nutch 2.x HEAD and fgora-cassandra 0.2, I observe the 
following from an empty database and brand new test
After Inject
{code}
[default@unknown] use webpage;
Authenticated to keyspace: webpage
[default@webpage] list p;
Using default limit of 100
Using default column limit of 100

0 Row Returned.
Elapsed time: 1 msec(s).
[default@webpage] list f;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 6f72672e6170616368652e676f72613a687474702f
=> (column=fi, value=2592000, timestamp=1360462001119000)
=> (column=s, value=1.0, timestamp=1360462001120000)
=> (column=ts, value=1360461999790, timestamp=1360462001097000)

1 Row Returned.
Elapsed time: 3 msec(s).
[default@webpage] list sc;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 6f72672e6170616368652e676f72613a687474702f
=> (super_column=mk,
     (column=_injmrk_, value=y, timestamp=1360462001124000)
     (column=dist, value=0, timestamp=1360462001122000))
=> (super_column=mtdt,
     (column=_csh_, value=?�, timestamp=1360462001126000))

1 Row Returned.
Elapsed time: 2 msec(s).

{code}

The value for the opic-scoring CASH_KEY persisted into Cassandra during the 
initial inject stage is crap even before I progress to fetch this particular 
url. 
                
> Metadata in MYSQL updated with 'garbage'
> ----------------------------------------
>
>                 Key: NUTCH-1511
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1511
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, injector, storage
>    Affects Versions: 2.1
>         Environment: Ubuntu 12.04
>            Reporter: J. Gobel
>              Labels: metadata, mysql, nutch, scoring-opic
>             Fix For: 2.2
>
>
> After applying patch for Metadata parser (NUTCH-1478) I notice that the 
> metadata field just before the crawl ends is populated with the correct 
> information. However when the crawl is completely finished the metadata field 
> is populated with 'garbage' _csh_����� 
> I notice in my SQL log file that the scoring plugin is overwriting the 
> metadata field in a final data insertion with '_csh_ \0\0\0\0\'. When I 
> remove 'scoring-opic' out of 'plugin.includes' property in the nutch-site.xml 
> , the metadata-field is crisp and clear.
> MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see 
> a fragments of my MYSQL log file, only the moments when data is written to 
> the METADATA field in the MYSQL table.
> First Insertion .. here I suppose scoring-opic writes its information, _csh_ 
> ?€\0\0\0 
> 58 Query    INSERT INTO webpage 
> (fetchInterval,fetchTime,id,markers,metadata,score )VALUES 
> (2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0','
> _csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE 
> fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 _injmrk_ 
> y\0',metadata='
> _csh_ ?€\0\0\0',score=1.0
> Second Insertion - inhere scraped metada is inserted into metadata. 
>  81 Query    INSERT INTO webpage 
> (id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES 
> ('org.apache.nutch:http/',
> The final insertion -  please note that here the metadata field is 
> overwritten with _CSH_\0\0\0\0
> 90 Query    INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata 
> )VALUES (1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/
> Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508 
> __prsmrk__*1357122982-1745626508 _gnmrk_*1357122982-1745626508 
> _ftcmrk_*1357122982-1745626508\0','
> _csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' 
> 0http://nutch.apache.org/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to