[ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178609#comment-17178609
 ] 

Hudson commented on NUTCH-1190:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #3 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/3/])
NUTCH-1190 MoreIndexingFilter: move data formats used to parse "lastModified" 
to a config file (snagel: 
[https://github.com/apache/nutch/commit/2c3d864222ef79ed19f33399b5abcd392f27c82a])
* (edit) 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
* (add) conf/date-styles.txt.template


> MoreIndexingFilter refactor: move data formats used to parse "lastModified" 
> to a config file.
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1190
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1190
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>    Affects Versions: 1.4
>         Environment: jdk6
>            Reporter: Zhang JinYan
>            Priority: Major
>             Fix For: 1.18
>
>         Attachments: MoreIndexingFilter.patch, NUTCH-1190-trunk.patch, 
> date-styles.txt
>
>
> There many issues about missing date format:
> [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
> [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
> [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
> The data formats can be diverse, so why not move those data formats to a 
> extra config file?
> I move all the data formats from "MoreIndexingFilter.java" to a file named 
> "date-styles.txt"(place in "conf"), which will be load on startup.
> {code}
>   public void setConf(Configuration conf) {
>     this.conf = conf;
>     MIME = new MimeUtil(conf);
>     
>     URL res = conf.getResource("date-styles.txt");
>     if(res==null){
>       LOG.error("Can't find resource: date-styles.txt");
>     }else{
>       try {
>         List lines = FileUtils.readLines(new File(res.getFile()));
>         for (int i = 0; i < lines.size(); i++) {
>           String dateStyle = (String) lines.get(i);
>           if(StringUtils.isBlank(dateStyle)){
>             lines.remove(i);
>             i--;
>             continue;
>           }
>           dateStyle=StringUtils.trim(dateStyle);
>           if(dateStyle.startsWith("#")){
>             lines.remove(i);
>             i--;
>             continue;
>           }
>           lines.set(i, dateStyle);
>         }
>         dateStyles = new String[lines.size()];
>         lines.toArray(dateStyles);
>       } catch (IOException e) {
>         LOG.error("Failed to load resource: date-styles.txt");
>       }
>     }
>   }
> {code}
> Then parse "lastModified" like this(sample):
> {code}
>   private long getTime(String date, String url) {
>     ......
>     Date parsedDate = DateUtils.parseDate(date, dateStyles);
>     time = parsedDate.getTime();
>     ......
>     return time;
>   }
> {code}
> This path also contains the "path" of 
> [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
> Find more details in the patch file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to