[ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178568#comment-17178568
 ] 

ASF GitHub Bot commented on NUTCH-1190:
---------------------------------------

sebastian-nagel closed pull request #545:
URL: https://github.com/apache/nutch/pull/545


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> MoreIndexingFilter refactor: move data formats used to parse "lastModified" 
> to a config file.
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1190
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1190
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>    Affects Versions: 1.4
>         Environment: jdk6
>            Reporter: Zhang JinYan
>            Priority: Major
>             Fix For: 1.18
>
>         Attachments: MoreIndexingFilter.patch, NUTCH-1190-trunk.patch, 
> date-styles.txt
>
>
> There many issues about missing date format:
> [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
> [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
> [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
> The data formats can be diverse, so why not move those data formats to a 
> extra config file?
> I move all the data formats from "MoreIndexingFilter.java" to a file named 
> "date-styles.txt"(place in "conf"), which will be load on startup.
> {code}
>   public void setConf(Configuration conf) {
>     this.conf = conf;
>     MIME = new MimeUtil(conf);
>     
>     URL res = conf.getResource("date-styles.txt");
>     if(res==null){
>       LOG.error("Can't find resource: date-styles.txt");
>     }else{
>       try {
>         List lines = FileUtils.readLines(new File(res.getFile()));
>         for (int i = 0; i < lines.size(); i++) {
>           String dateStyle = (String) lines.get(i);
>           if(StringUtils.isBlank(dateStyle)){
>             lines.remove(i);
>             i--;
>             continue;
>           }
>           dateStyle=StringUtils.trim(dateStyle);
>           if(dateStyle.startsWith("#")){
>             lines.remove(i);
>             i--;
>             continue;
>           }
>           lines.set(i, dateStyle);
>         }
>         dateStyles = new String[lines.size()];
>         lines.toArray(dateStyles);
>       } catch (IOException e) {
>         LOG.error("Failed to load resource: date-styles.txt");
>       }
>     }
>   }
> {code}
> Then parse "lastModified" like this(sample):
> {code}
>   private long getTime(String date, String url) {
>     ......
>     Date parsedDate = DateUtils.parseDate(date, dateStyles);
>     time = parsedDate.getTime();
>     ......
>     return time;
>   }
> {code}
> This path also contains the "path" of 
> [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
> Find more details in the patch file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to