[jira] [Comment Edited] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

kiran (JIRA) Mon, 31 Dec 2012 09:44:13 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541428#comment-13541428
 ]


kiran edited comment on NUTCH-1478 at 12/31/12 5:43 PM:
--------------------------------------------------------

Hi Jaap,

I have ran the same command as you did and looks like there are no metatags in 
that page. Please check the attached 
[screenshot|https://issues.apache.org/jira/secure/attachment/12562805/metadata_parseChecker_sites.png]
 of different websites i parsed and the metadata with it. 

Once parsechecker is working, we should make sure the indexing is working. For 
that, we need to define what fields we want to be indexed in (index.parse.md) 
field in nutch-site.xml. There is a difference in 1.x and 2.x in the way this 
field should be defined. 

When i was working with this plugin, i was able to define the metatag fields as 
it is (without a preceding metatags tag like 1.x) and the same way in the 
schema and it worked for me. This is my schema 
(https://github.com/salvager/apache-solr-4.0.0-BETA/blob/master/example/solr/ejournals/conf/schema.xml).
 

The dc fields that i have defined are particular to the website i am crawling. 
They might not be present in all the websites. 

I hope this helps. 

                
      was (Author: kiranch):
    Hi Jaap,

I have ran the same command as you did and looks like there are no metatags in 
that page. Please check the attached screenshot of different websites i parsed 
and the metadata with it. 

Once parsechecker is working, we should make sure the indexing is working. For 
that, we need to define what fields we want to be indexed in (index.parse.md) 
field in nutch-site.xml. There is a difference in 1.x and 2.x in the way this 
field should be defined. 

When i was working with this plugin, i was able to define the metatag fields as 
it is (without a preceding metatags tag like 1.x) and the same way in the 
schema and it worked for me. This is my schema 
(https://github.com/salvager/apache-solr-4.0.0-BETA/blob/master/example/solr/ejournals/conf/schema.xml).
 

The dc fields that i have defined are particular to the website i am crawling. 
They might not be present in all the websites. 

I hope this helps. 


                  
> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --------------------------------------------------------------
>
>                 Key: NUTCH-1478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1478
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.1
>            Reporter: kiran
>         Attachments: metadata_parseChecker_sites.png, Nutch1478.patch, 
> Nutch1478.zip
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
> This will take multiple values of same tag and index in Solr as i patched 
> before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here 
> (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
> no need to give 'metatag' keyword before metatag names. For example my 
> configuration looks like this 
> (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
>  
> This is only the first version and does not include the junit test. I will 
> update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the 
> fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

Reply via email to