MoreIndexingFilter refactor: move data formats used to parse "lastModified" to
a config file.
---------------------------------------------------------------------------------------------
Key: NUTCH-1190
URL: https://issues.apache.org/jira/browse/NUTCH-1190
Project: Nutch
Issue Type: Improvement
Components: indexer
Affects Versions: 1.4
Environment: jdk6
Reporter: Zhang JinYan
There many issues about missing date format:
[NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
[NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
[NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
The data formats can be diverse, so why not move those data formats to a extra
config file?
I move all the data formats from "MoreIndexingFilter.java" to a file named
"date-styles.txt", which will be load on startup.
{code}
public void setConf(Configuration conf) {
this.conf = conf;
MIME = new MimeUtil(conf);
URL res = conf.getResource("date-styles.txt");
if(res==null){
LOG.error("Can't find resource: date-styles.txt");
}else{
try {
List lines = FileUtils.readLines(new File(res.getFile()));
for (int i = 0; i < lines.size(); i++) {
String dateStyle = (String) lines.get(i);
if(StringUtils.isBlank(dateStyle)){
lines.remove(i);
i--;
continue;
}
dateStyle=StringUtils.trim(dateStyle);
if(dateStyle.startsWith("#")){
lines.remove(i);
i--;
continue;
}
lines.set(i, dateStyle);
}
dateStyles = new String[lines.size()];
lines.toArray(dateStyles);
} catch (IOException e) {
LOG.error("Failed to load resource: date-styles.txt");
}
}
}
{code}
Then parse "lastModified" like this(sample):
{code}
private long getTime(String date, String url) {
......
Date parsedDate = DateUtils.parseDate(date, dateStyles);
time = parsedDate.getTime();
......
return time;
}
{code}
This path also contains the "path" of
[NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
Find more details in the patch file.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira