[ 
https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170189#comment-17170189
 ] 

ASF GitHub Bot commented on NUTCH-1190:
---------------------------------------

jorgelbg commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r464527051



##########
File path: 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL dateStylesResource = conf.getResource("date-styles.txt");
+    if (dateStylesResource == null) {
+      LOG.warn("Can't find resource: date-styles.txt");

Review comment:
       Right now the config file is optional. If this is the desired behavior 
then I would also add to the message (since it is a warning) something that the 
default list of dates will be used.

##########
File path: conf/date-styles.txt
##########
@@ -0,0 +1,56 @@
+#

Review comment:
       The standard ASF license header should be included in all files [for 
example](https://github.com/apache/nutch/blob/master/conf/host-urlnormalizer.txt.template#L1-L14)

##########
File path: 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -82,6 +87,20 @@
   private HashMap<String, String> mimeMap = null;
   private boolean mapMimes = false;
   private String mapFieldName;
+  
+  /** Date-styles used to parse date. */
+  private String[] dateStyles = new String[] {

Review comment:
       Not sure if we want to keep these here? It gets overwritten with the 
content of the config file if the additional config file loading is successful. 
_If_ we want to keep them I would rename them to something like 
`defaultDateStyles` and potentially only use the config file for additional 
formats.

##########
File path: 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL dateStylesResource = conf.getResource("date-styles.txt");
+    if (dateStylesResource == null) {
+      LOG.warn("Can't find resource: date-styles.txt");
+    } else {
+      try {
+        List lines = FileUtils.readLines(new 
File(dateStylesResource.getFile()));
+        for (int i = 0; i < lines.size(); i++) {
+          String dateStyle = (String) lines.get(i);
+          
+          if (StringUtils.isBlank(dateStyle)) {
+            lines.remove(i);

Review comment:
       Since we don't need the line number for the next stages, perhaps could 
skip the line if it is empty, `\n`, or a comment (`#`). The idea would be to 
avoid modifying the counter and if we reach the end of the block then we know 
that the line has information that should be kept.

##########
File path: conf/date-styles.txt
##########
@@ -0,0 +1,56 @@
+#
+# Set of date time formats
+# used by the plugin index-more when filling the index field `lastModified'.
+#
+# Format (line separated date time patterns following definition in 
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html, 
comment lines start with `#'):

Review comment:
       nit: perhaps we should trim this to ~120 columns?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> MoreIndexingFilter refactor: move data formats used to parse "lastModified" 
> to a config file.
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1190
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1190
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>    Affects Versions: 1.4
>         Environment: jdk6
>            Reporter: Zhang JinYan
>            Priority: Major
>             Fix For: 1.18
>
>         Attachments: MoreIndexingFilter.patch, NUTCH-1190-trunk.patch, 
> date-styles.txt
>
>
> There many issues about missing date format:
> [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
> [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
> [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
> The data formats can be diverse, so why not move those data formats to a 
> extra config file?
> I move all the data formats from "MoreIndexingFilter.java" to a file named 
> "date-styles.txt"(place in "conf"), which will be load on startup.
> {code}
>   public void setConf(Configuration conf) {
>     this.conf = conf;
>     MIME = new MimeUtil(conf);
>     
>     URL res = conf.getResource("date-styles.txt");
>     if(res==null){
>       LOG.error("Can't find resource: date-styles.txt");
>     }else{
>       try {
>         List lines = FileUtils.readLines(new File(res.getFile()));
>         for (int i = 0; i < lines.size(); i++) {
>           String dateStyle = (String) lines.get(i);
>           if(StringUtils.isBlank(dateStyle)){
>             lines.remove(i);
>             i--;
>             continue;
>           }
>           dateStyle=StringUtils.trim(dateStyle);
>           if(dateStyle.startsWith("#")){
>             lines.remove(i);
>             i--;
>             continue;
>           }
>           lines.set(i, dateStyle);
>         }
>         dateStyles = new String[lines.size()];
>         lines.toArray(dateStyles);
>       } catch (IOException e) {
>         LOG.error("Failed to load resource: date-styles.txt");
>       }
>     }
>   }
> {code}
> Then parse "lastModified" like this(sample):
> {code}
>   private long getTime(String date, String url) {
>     ......
>     Date parsedDate = DateUtils.parseDate(date, dateStyles);
>     time = parsedDate.getTime();
>     ......
>     return time;
>   }
> {code}
> This path also contains the "path" of 
> [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
> Find more details in the patch file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to