Revision: 17362
          http://sourceforge.net/p/gate/code/17362
Author:   ian_roberts
Date:     2014-02-20 12:12:57 +0000 (Thu, 20 Feb 2014)
Log Message:
-----------
Re-instate the stripping of protocol, query and fragment which slipped quietly
away when we introduced support for remote ARC records with HTTP Range
requests.

Modified Paths:
--------------
    gcp/trunk/src/gate/cloud/io/arc/ARCDocumentNamingStrategy.java

Modified: gcp/trunk/src/gate/cloud/io/arc/ARCDocumentNamingStrategy.java
===================================================================
--- gcp/trunk/src/gate/cloud/io/arc/ARCDocumentNamingStrategy.java      
2014-02-20 12:02:24 UTC (rev 17361)
+++ gcp/trunk/src/gate/cloud/io/arc/ARCDocumentNamingStrategy.java      
2014-02-20 12:12:57 UTC (rev 17362)
@@ -25,7 +25,7 @@
 import gate.util.GateException;
 
 /**
- * A naming strategy to convert document IDs suitable for use with
+ * <p>A naming strategy to convert document IDs suitable for use with
  * an {@link ArchiveInputHandler} to file paths suitable for saving the
  * results of their processing.  It assumes that the document IDs
  * use the record URL as the id text (see {@link DocumentID#getIdText()}), and
@@ -36,16 +36,17 @@
  * directories constructed by padding the document sequence number to the left 
  * with zeros and creating intermediate directories according to a configurable
  * pattern.  The default pattern is '3/3', which pads the numbers to a minimum 
- * of 6 digits and then splits them up into groups of three.  The remainder of 
- * the ID after the number is cleaned up to remove any URL protocol like
- * http:// and any query string or fragment.  Any sequences of non-ASCII
- * characters are removed and any remaining slashes or colons are replaced
- * with underscores.  For example with the default pattern, the document
- * ID '0001_http://example.org/file.html?param=value' maps to the file
+ * of 6 digits and then splits them up into groups of three.  The ID text
+ * is cleaned up to remove any URL protocol like http:// and any query string
+ * or fragment.  Any sequences of non-ASCII characters are removed and any
+ * remaining slashes or colons are replaced with underscores.</p>
+ * 
+ * <p>For example with the default pattern, the document
+ * ID with <code>recordPosition="1"</code> and URL 
'http://example.org/file.html?param=value' maps to the file
  * 000/001_example.org_file.html (with any additional configured file
- * extension appended).  If the leading number has more digits than the
+ * extension appended).  If the numeric part has more digits than the
  * pattern allows then additional digits are used in the first place, so
- * the ID 1234567 maps to 1234/567 rather than 123/4567.
+ * the ID 1234567 maps to 1234/567 rather than 123/4567.</p>
  * @author ian
  *
  */
@@ -68,6 +69,12 @@
    */
   protected static final Pattern NON_FILENAME_PATTERN = Pattern.compile(
           "[/:\\\\]");
+  
+  /**
+   * Pattern to strip the protocol, query string and fragment from a URL.
+   */
+  protected static final Pattern STRIP_PROTOCOL_QUERY_FRAGMENT =
+          Pattern.compile("^(?:.*?://)?(.*?)(?:\\?.*)?(?:#.*)?$");
 
   public void config(boolean isOutput, Map<String, String> configData)
           throws IOException, GateException {
@@ -147,6 +154,13 @@
     // the rest of the output file path is constructed from the record URL
     String remaining = id.getIdText();
     if(remaining != null && remaining.length() > 0) {
+      // strip the protocol, query and fragment
+      Matcher stripQueryMatcher = 
STRIP_PROTOCOL_QUERY_FRAGMENT.matcher(remaining);
+      if(stripQueryMatcher.find()) {
+        // this matcher should never fail as every string can match the (.*) 
part,
+        // but be conservative anyway
+        remaining = stripQueryMatcher.group(1);
+      }
       // append an underscore and the cleaned-up remaining part of the name
       pathBuilder.append("_");
       Matcher nonAsciiMatcher = NON_ASCII_PATTERN.matcher(remaining);

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

Reply via email to