[
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541428#comment-13541428
]
kiran edited comment on NUTCH-1478 at 12/31/12 5:43 PM:
--------------------------------------------------------
Hi Jaap,
I have ran the same command as you did and looks like there are no metatags in
that page. Please check the attached
[screenshot|https://issues.apache.org/jira/secure/attachment/12562805/metadata_parseChecker_sites.png]
of different websites i parsed and the metadata with it.
Once parsechecker is working, we should make sure the indexing is working. For
that, we need to define what fields we want to be indexed in (index.parse.md)
field in nutch-site.xml. There is a difference in 1.x and 2.x in the way this
field should be defined.
When i was working with this plugin, i was able to define the metatag fields as
it is (without a preceding metatags tag like 1.x) and the same way in the
schema and it worked for me. This is my schema
(https://github.com/salvager/apache-solr-4.0.0-BETA/blob/master/example/solr/ejournals/conf/schema.xml).
The dc fields that i have defined are particular to the website i am crawling.
They might not be present in all the websites.
I hope this helps.
was (Author: kiranch):
Hi Jaap,
I have ran the same command as you did and looks like there are no metatags in
that page. Please check the attached screenshot of different websites i parsed
and the metadata with it.
Once parsechecker is working, we should make sure the indexing is working. For
that, we need to define what fields we want to be indexed in (index.parse.md)
field in nutch-site.xml. There is a difference in 1.x and 2.x in the way this
field should be defined.
When i was working with this plugin, i was able to define the metatag fields as
it is (without a preceding metatags tag like 1.x) and the same way in the
schema and it worked for me. This is my schema
(https://github.com/salvager/apache-solr-4.0.0-BETA/blob/master/example/solr/ejournals/conf/schema.xml).
The dc fields that i have defined are particular to the website i am crawling.
They might not be present in all the websites.
I hope this helps.
> Parse-metatags and index-metadata plugin for Nutch 2.x series
> --------------------------------------------------------------
>
> Key: NUTCH-1478
> URL: https://issues.apache.org/jira/browse/NUTCH-1478
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.1
> Reporter: kiran
> Attachments: metadata_parseChecker_sites.png, Nutch1478.patch,
> Nutch1478.zip
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.
> This will take multiple values of same tag and index in Solr as i patched
> before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here
> (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is
> no need to give 'metatag' keyword before metatag names. For example my
> configuration looks like this
> (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
>
> This is only the first version and does not include the junit test. I will
> update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the
> fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira