jorgelbg commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r464527051



##########
File path: 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL dateStylesResource = conf.getResource("date-styles.txt");
+    if (dateStylesResource == null) {
+      LOG.warn("Can't find resource: date-styles.txt");

Review comment:
       Right now the config file is optional. If this is the desired behavior 
then I would also add to the message (since it is a warning) something that the 
default list of dates will be used.

##########
File path: conf/date-styles.txt
##########
@@ -0,0 +1,56 @@
+#

Review comment:
       The standard ASF license header should be included in all files [for 
example](https://github.com/apache/nutch/blob/master/conf/host-urlnormalizer.txt.template#L1-L14)

##########
File path: 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -82,6 +87,20 @@
   private HashMap<String, String> mimeMap = null;
   private boolean mapMimes = false;
   private String mapFieldName;
+  
+  /** Date-styles used to parse date. */
+  private String[] dateStyles = new String[] {

Review comment:
       Not sure if we want to keep these here? It gets overwritten with the 
content of the config file if the additional config file loading is successful. 
_If_ we want to keep them I would rename them to something like 
`defaultDateStyles` and potentially only use the config file for additional 
formats.

##########
File path: 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL dateStylesResource = conf.getResource("date-styles.txt");
+    if (dateStylesResource == null) {
+      LOG.warn("Can't find resource: date-styles.txt");
+    } else {
+      try {
+        List lines = FileUtils.readLines(new 
File(dateStylesResource.getFile()));
+        for (int i = 0; i < lines.size(); i++) {
+          String dateStyle = (String) lines.get(i);
+          
+          if (StringUtils.isBlank(dateStyle)) {
+            lines.remove(i);

Review comment:
       Since we don't need the line number for the next stages, perhaps could 
skip the line if it is empty, `\n`, or a comment (`#`). The idea would be to 
avoid modifying the counter and if we reach the end of the block then we know 
that the line has information that should be kept.

##########
File path: conf/date-styles.txt
##########
@@ -0,0 +1,56 @@
+#
+# Set of date time formats
+# used by the plugin index-more when filling the index field `lastModified'.
+#
+# Format (line separated date time patterns following definition in 
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html, 
comment lines start with `#'):

Review comment:
       nit: perhaps we should trim this to ~120 columns?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to