[
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kristof updated NUTCH-1406:
----------------------------
Description:
This improvement to the index-metatags plugin (sometimes also refered to
parse-metatags plugin) allows for conversion of selected fields to the Solr
date format and prevents parsing/indexing of metatags that do not contain any
content.
In order to convert the values of selected metatags to Solr date format, you
must specify in nutch-site.xml. The example used is a simple Dublin Core
element dc.date. It must also be defined in the metatags.names property.
{code}
<property>
<name>metatags.convert</name>
<value>dc.date</value>
<description>For plugin index-metatags: Indicate here the name of the
html meta tag that should be converted to date format.
</description>
</property>
{code}
I read that SimpleDateFormat format is not a robust solution, so this
improvement might have some problems.
So far it worked well for me. Below more details about the changes.
Changes made to MetaTagsIndexer.java between lines 41 and 71:
{code}
if (tagEntry != null && tagEntry.trim().length() > 0)
{
if (checkDateConversion(metatag)) {
Date date = null;
try {
date = new
SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
doc.add(metatag, date);
} catch (ParseException e) {
e.printStackTrace();
if (LOG.isTraceEnabled()) {
LOG.trace(url.toString() + " : date conversion failed for " +
tagEntry + " in " + metatag + " field");
}
}
}
else {
doc.add(metatag, tagEntry);
}
if (LOG.isTraceEnabled()) {
LOG.trace(url.toString() + " : successfully added " +
tagEntry + " to the " + metatag + " field");
}
}
else {
if (LOG.isTraceEnabled()) {
LOG.trace(url.toString() + " : " + metatag + " and " +
tagEntry + " not added as Metatag does not have any content");
}
}
{code}
Method added to MetaTagsIndexer.java:
{code}
public boolean checkDateConversion (String metatag){
String convertToDate = conf.get("metatags.convert", "*");
String[] fieldsToConvert = convertToDate.split(";");
boolean convert = false;
for (String check : fieldsToConvert)
if (check.equals(metatag)) convert = true;
return convert;
}
{code}
was:
This improvement to the index-metatags plugin (sometimes also refered to
parse-metatags plugin) allows for conversion of selected fields to the Solr
date format and prevents parsing/indexing of metatags that do not contain any
content.
In order to convert the values of selected metatags to Solr date format, you
must specify in nutch-site.xml. The example used is a simple Dublin Core
element dc.date. It must also be defined in the metatags.names property.
{code}
<property>
<name>metatags.convert</name>
<value>dc.date</value>
<description>For plugin index-metatags: Indicate here the name
of the
html meta tag that should be converted to date format.
</description>
</property>
{code}
I read that SimpleDateFormat format is not a robust solution, so this
improvement might have some problems.
So far it worked well for me. Below more details about the changes.
Changes made to MetaTagsIndexer.java between lines 41 and 71:
{code}
if (tagEntry != null && tagEntry.trim().length() > 0)
{
if (checkDateConversion(metatag)) {
Date date = null;
try {
date = new
SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
doc.add(metatag, date);
} catch (ParseException e) {
e.printStackTrace();
if (LOG.isTraceEnabled()) {
LOG.trace(url.toString() + " : date conversion failed for " +
tagEntry + " in " + metatag + " field");
}
}
}
else {
doc.add(metatag, tagEntry);
}
if (LOG.isTraceEnabled()) {
LOG.trace(url.toString() + " : successfully added " +
tagEntry + " to the " + metatag + " field");
}
}
else {
if (LOG.isTraceEnabled()) {
LOG.trace(url.toString() + " : " + metatag + " and " +
tagEntry + " not added as Metatag does not have any content");
}
}
{code}
Method added to MetaTagsIndexer.java:
{code}
public boolean checkDateConversion (String metatag){
String convertToDate = conf.get("metatags.convert", "*");
String[] fieldsToConvert = convertToDate.split(";");
boolean convert = false;
for (String check : fieldsToConvert)
if (check.equals(metatag)) convert = true;
return convert;
}
{code}
> Metatags-index/-parse plugin: conversion to Solr date format and prevents
> parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-1406
> URL: https://issues.apache.org/jira/browse/NUTCH-1406
> Project: Nutch
> Issue Type: Improvement
> Components: indexer, parser
> Reporter: Kristof
> Priority: Minor
> Labels: conversion, date
> Attachments: index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to
> parse-metatags plugin) allows for conversion of selected fields to the Solr
> date format and prevents parsing/indexing of metatags that do not contain any
> content.
> In order to convert the values of selected metatags to Solr date format, you
> must specify in nutch-site.xml. The example used is a simple Dublin Core
> element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
> <name>metatags.convert</name>
> <value>dc.date</value>
> <description>For plugin index-metatags: Indicate here the name of the
> html meta tag that should be converted to date format.
> </description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this
> improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
> if (tagEntry != null && tagEntry.trim().length() > 0)
> {
> if (checkDateConversion(metatag)) {
>
> Date date = null;
>
> try {
> date = new
> SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
> doc.add(metatag, date);
> } catch (ParseException e) {
> e.printStackTrace();
>
> if (LOG.isTraceEnabled()) {
> LOG.trace(url.toString() + " : date conversion failed for " +
> tagEntry + " in " + metatag + " field");
> }
> }
> }
> else {
> doc.add(metatag, tagEntry);
> }
>
> if (LOG.isTraceEnabled()) {
> LOG.trace(url.toString() + " : successfully added " +
> tagEntry + " to the " + metatag + " field");
> }
> }
> else {
>
> if (LOG.isTraceEnabled()) {
> LOG.trace(url.toString() + " : " + metatag + " and " +
> tagEntry + " not added as Metatag does not have any content");
> }
> }
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
> public boolean checkDateConversion (String metatag){
> String convertToDate = conf.get("metatags.convert", "*");
> String[] fieldsToConvert = convertToDate.split(";");
> boolean convert = false;
>
> for (String check : fieldsToConvert)
> if (check.equals(metatag)) convert = true;
>
>
> return convert;
> }
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira