[
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541626#comment-13541626
]
J. Gobel commented on NUTCH-1478:
---------------------------------
Hi Kiran,
I have spent some time checking and monitoring the updates in my MSQL Metadata
fiel. And something odd is happening.
Just before the crawling is finished, the metadata field is updated with
correct information, I can see the field being updated with robotsindex, follow
description etc. . But as soon as it finished the metadata field is updated to
:_csh_�����
I copy pasted my log here below (just the last lines). I am aware that there
are still some issues with MYSQL as backend for Nutch 2.x
013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing http://nutch.apache.com/
2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : robots
index, follow
2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag : keywords
.com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain registry, nic,
extention, icann
2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag :
description Registreer nu uw .com.nl of .net.nl extentie.
2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules for
scope 'outlink', using default
2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is null
in cleanup
2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is null
in cleanup
> Parse-metatags and index-metadata plugin for Nutch 2.x series
> --------------------------------------------------------------
>
> Key: NUTCH-1478
> URL: https://issues.apache.org/jira/browse/NUTCH-1478
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.1
> Reporter: kiran
> Attachments: metadata_parseChecker_sites.png, Nutch1478.patch,
> Nutch1478.zip
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.
> This will take multiple values of same tag and index in Solr as i patched
> before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here
> (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is
> no need to give 'metatag' keyword before metatag names. For example my
> configuration looks like this
> (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
>
> This is only the first version and does not include the junit test. I will
> update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the
> fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira