[GitHub] [tomcat] martin-g commented on a change in pull request #352: Optimizing Resource Lookup using Bloom Filter

GitBox Thu, 03 Sep 2020 07:01:42 -0700


martin-g commented on a change in pull request #352:
URL: https://github.com/apache/tomcat/pull/352#discussion_r482996760




##########
File path: java/org/apache/catalina/webresources/JarContents.java
##########
@@ -0,0 +1,144 @@
+package org.apache.catalina.webresources;
+
+import java.util.BitSet;
+import java.util.Enumeration;
+import java.util.jar.JarEntry;
+import java.util.jar.JarFile;
+
+/**
+ * This class represents the contents of a jar by determining whether a given
+ * resource <b>might</b> be in the cache, based on a bloom filter. This is not 
a
+ * general-purpose bloom filter because it contains logic to strip out
+ * characters from the beginning of the key.
+ *
+ * The hash methods are simple but good enough for this purpose.
+ */
+public final class JarContents {
+    private final BitSet bits1;
+    private final BitSet bits2;
+    /**
+     * Constant used by a typical hashing method.
+     */
+    private static final int HASH_PRIME_1 = 31;
+
+    /**
+     * Constant used by a typical hashing method.
+     */
+    private static final int HASH_PRIME_2 = 17;
+
+    /**
+     * Size of the fixed-length bit table. Larger reduces false positives,
+     * smaller saves memory.
+     */
+    private static final int TABLE_SIZE = 2048;
+
+    /**
+     * Parses the passed-in jar and populates the bit array.
+     *
+     * @param jar
+     */
+    public JarContents(JarFile jar) {
+        Enumeration<JarEntry> entries = jar.entries();
+        bits1 = new BitSet(TABLE_SIZE);
+        bits2 = new BitSet(TABLE_SIZE);
+
+        // Enumerations. When will they update this API?!
+        while (entries.hasMoreElements()) {
+            JarEntry entry = entries.nextElement();
+            String name = entry.getName();
+            int startPos = 0;
+
+            // If the path starts with a slash, that's not useful information.
+            // Skipping it increases the significance of our key by
+            // removing an insignificant character.
+            boolean precedingSlash = name.charAt(0) == '/';
+            if (precedingSlash) {
+                startPos = 1;
+            }
+
+            // Find the correct table slot
+            int pathHash1 = hashcode1(name, startPos);
+            int pathHash2 = hashcode2(name, startPos);
+
+            bits1.set(pathHash1 % TABLE_SIZE);
+            bits2.set(pathHash2 % TABLE_SIZE);
+        }
+    }
+
+    /**
+     * Simple hashcode of a portion of the string. Typically we would use
+     * substring, but memory and runtime speed are critical.
+     *
+     * @param content
+     *            Wrapping String.
+     * @param startPos
+     *            First character in the range.
+     * @return hashcode of the range.
+     */
+    private int hashcode1(String content, int startPos) {
+        int h = 7;
+        int contentLength = content.length();
+        for (int i = startPos; i < contentLength; i++) {
+            h = HASH_PRIME_1 * h + content.charAt(i);
+        }
+
+        if (h < 0) {
+            h = h * -1;
+        }
+        return h;
+    }
+
+    /**
+     * Simple hashcode of a portion of the string. Typically we would use
+     * substring, but memory and runtime speed are critical.
+     *
+     * @param content
+     *            Wrapping String.
+     * @param startPos
+     *            First character in the range.
+     * @return hashcode of the range.
+     */
+    private int hashcode2(String content, int startPos) {
+        int h = 7;

Review comment:
       The method body looks the same as `hashcode1`. Why don't you add an 
additional parameter for the `HASH_PRIME` and remove one of them

##########
File path: java/org/apache/catalina/webresources/AbstractArchiveResourceSet.java
##########
@@ -211,6 +216,43 @@ public final WebResource getResource(String path) {
         String webAppMount = getWebAppMount();
         WebResourceRoot root = getRoot();
 
+
+        /*
+         * This initializes (when necessary) and checks the jarContents, which
+         * is a highly efficient index of the files stored in the jar. If
+         * jarContents reports that this resource definitely does not contain
+         * the path, we can end this method and move on to the next jar.
+         *
+         * Note: the initialization is thread-safe because multiple 
simultaneous
+         * threads will create a complete and valid copy, then set the shared
+         * pointer. This guarantees the shared pointer will always go to a
+         * valid object. The cost of multiple copies is small since only one of
+         * them will be long-lived.
+         */
+        try {
+            if ((root.getContext() != null) && (root.getContext().getParent()) 
!= null &&
+                    (((Host) 
root.getContext().getParent()).getFastClasspathScanning())) {
+
+                if (jarContents == null ||
+                        ((System.currentTimeMillis() - prevJarOpenTime > 
jarRefreshTime) &&
+                            !openJarFile().equals(jar))) {
+                    try {
+                        jar = openJarFile();

Review comment:
       You call `openJarFile()` twice here. This will increase 
`org.apache.catalina.webresources.AbstractArchiveResourceSet#archiveUseCount` 
twice and later it won't be GC-ed by 
`org.apache.catalina.webresources.AbstractArchiveResourceSet#gc`.
   I think you need to call 
`org.apache.catalina.webresources.AbstractArchiveResourceSet#closeJarFile` 
after creating the JarContents




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [tomcat] martin-g commented on a change in pull request #352: Optimizing Resource Lookup using Bloom Filter

Reply via email to