[
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vangelis Karvounis updated NUTCH-1478:
--------------------------------------
Attachment: NUTCH-1478v5.1.patch
I have made a patch but I don't know if I have done it correct.. :P
Anyway, my goal here was to input both property and rel tags. I would be glad
if I could be of any help!
Vangelis
If you want to patch this version, you need to alter the
plugin/parse-metatags/MetaTagsParser.java from the latest v5 patch as following:
Add the following code just before 'return parse' inside the method
ParseFilter(String url, WebPage page, Parse parse,HTMLMetaTags metaTags,
DocumentFragment doc)
Properties property = metaTags.getPropertyTags();
Enumeration<?> properNames = property.propertyNames();
while (properNames.hasMoreElements()) {
String name1 = (String) properNames.nextElement();
String value1 = property.getProperty(name1);
if (metatagset.contains("*") ||
metatagset.contains(name1.toLowerCase())) {
LOG.debug("Found meta tag : " + name1 + "\t" + value1);
//System.out.println("Found meta tag : " + name1 + "\t" + value1);
page.putToMetadata(new Utf8(PARSE_META_PREFIX +
name1.toLowerCase()),
ByteBuffer.wrap(value1.getBytes()));
}
}
Properties relProp = metaTags.getRelTags();
Enumeration<?> relNames = relProp.propertyNames();
while (relNames.hasMoreElements()) {
String name2 = (String) relNames.nextElement();
String value2 = relProp.getProperty(name2);
if (metatagset.contains("*") ||
metatagset.contains(name2.toLowerCase())) {
LOG.debug("Found meta tag : " + name2 + "\t" + value2);
//System.out.println("Found meta tag : " + name1 + "\t" + value1);
page.putToMetadata(new Utf8(PARSE_META_PREFIX +
name2.toLowerCase()),
ByteBuffer.wrap(value2.getBytes()));
}
}
//System.out.println(" "+metaTags.toString());
> Parse-metatags and index-metadata plugin for Nutch 2.x series
> --------------------------------------------------------------
>
> Key: NUTCH-1478
> URL: https://issues.apache.org/jira/browse/NUTCH-1478
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.1
> Reporter: kiran
> Fix For: 2.3
>
> Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch,
> NUTCH-1478v4.patch, NUTCH-1478v5.1.patch, NUTCH-1478v5.patch,
> Nutch1478.patch, Nutch1478.zip, metadata_parseChecker_sites.png
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.
> This will take multiple values of same tag and index in Solr as i patched
> before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here
> (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is
> no need to give 'metatag' keyword before metatag names. For example my
> configuration looks like this
> (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
>
> This is only the first version and does not include the junit test. I will
> update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the
> fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.
--
This message was sent by Atlassian JIRA
(v6.2#6252)