[
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170102#comment-17170102
]
ASF GitHub Bot commented on NUTCH-1190:
---------------------------------------
derhecht commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r464478821
##########
File path:
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
}
}
+
+ URL res = conf.getResource("date-styles.txt");
Review comment:
ok, any suggestion? the only default file naming thing I can see is that
every one else is creating a template instead really/directly uses a file.
so what about:
index-more-last-modified-date-styles.txt
?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> MoreIndexingFilter refactor: move data formats used to parse "lastModified"
> to a config file.
> ---------------------------------------------------------------------------------------------
>
> Key: NUTCH-1190
> URL: https://issues.apache.org/jira/browse/NUTCH-1190
> Project: Nutch
> Issue Type: Improvement
> Components: indexer, plugin
> Affects Versions: 1.4
> Environment: jdk6
> Reporter: Zhang JinYan
> Priority: Major
> Fix For: 1.18
>
> Attachments: MoreIndexingFilter.patch, NUTCH-1190-trunk.patch,
> date-styles.txt
>
>
> There many issues about missing date format:
> [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
> [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
> [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
> The data formats can be diverse, so why not move those data formats to a
> extra config file?
> I move all the data formats from "MoreIndexingFilter.java" to a file named
> "date-styles.txt"(place in "conf"), which will be load on startup.
> {code}
> public void setConf(Configuration conf) {
> this.conf = conf;
> MIME = new MimeUtil(conf);
>
> URL res = conf.getResource("date-styles.txt");
> if(res==null){
> LOG.error("Can't find resource: date-styles.txt");
> }else{
> try {
> List lines = FileUtils.readLines(new File(res.getFile()));
> for (int i = 0; i < lines.size(); i++) {
> String dateStyle = (String) lines.get(i);
> if(StringUtils.isBlank(dateStyle)){
> lines.remove(i);
> i--;
> continue;
> }
> dateStyle=StringUtils.trim(dateStyle);
> if(dateStyle.startsWith("#")){
> lines.remove(i);
> i--;
> continue;
> }
> lines.set(i, dateStyle);
> }
> dateStyles = new String[lines.size()];
> lines.toArray(dateStyles);
> } catch (IOException e) {
> LOG.error("Failed to load resource: date-styles.txt");
> }
> }
> }
> {code}
> Then parse "lastModified" like this(sample):
> {code}
> private long getTime(String date, String url) {
> ......
> Date parsedDate = DateUtils.parseDate(date, dateStyles);
> time = parsedDate.getTime();
> ......
> return time;
> }
> {code}
> This path also contains the "path" of
> [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
> Find more details in the patch file.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)