[jira] [Commented] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

J. Gobel (JIRA) Tue, 01 Jan 2013 04:08:18 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541626#comment-13541626
 ]


J. Gobel commented on NUTCH-1478:
---------------------------------

Hi Kiran,

I have spent some time checking and monitoring the updates in my MSQL Metadata 
fiel. And something odd is happening.
Just before the crawling is finished, the metadata field is updated with 
correct information, I can see the field being updated with robotsindex, follow 
description etc. . But as soon as it finished the metadata field is updated to 
:_csh_�����

I copy pasted my log here below (just the last lines). I am aware that there 
are still some issues with MYSQL as backend for Nutch 2.x 


013-01-01 11:55:53,177 INFO  crawl.SignatureFactory - Using Signature impl: 
org.apache.nutch.crawl.MD5Signature
2013-01-01 11:55:53,903 INFO  parse.ParserJob - Parsing http://nutch.apache.com/
2013-01-01 11:55:54,589 WARN  parse.MetaTagsParser - Found meta tag : robots    
index, follow
2013-01-01 11:55:54,589 WARN  parse.MetaTagsParser - Found meta tag : keywords  
.com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, nic, 
extention, icann
2013-01-01 11:55:54,590 WARN  parse.MetaTagsParser - Found meta tag : 
description       Registreer nu uw .com.nl of .net.nl extentie.
2013-01-01 11:55:54,619 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'outlink', using default
2013-01-01 11:55:55,240 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2013-01-01 11:55:56,652 INFO  mapreduce.GoraRecordReader - 
gora.buffer.read.limit = 10000
2013-01-01 11:55:59,574 INFO  mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 10000
2013-01-01 11:55:59,575 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-01 11:55:59,575 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2013-01-01 11:55:59,575 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2013-01-01 11:56:02,554 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
                
> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --------------------------------------------------------------
>
>                 Key: NUTCH-1478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1478
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.1
>            Reporter: kiran
>         Attachments: metadata_parseChecker_sites.png, Nutch1478.patch, 
> Nutch1478.zip
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
> This will take multiple values of same tag and index in Solr as i patched 
> before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here 
> (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
> no need to give 'metatag' keyword before metatag names. For example my 
> configuration looks like this 
> (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
>  
> This is only the first version and does not include the junit test. I will 
> update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the 
> fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

Reply via email to