Nuno Santos created OAK-10671: --------------------------------- Summary: [Indexing job] - Improve Mongo regex query: remove condition on non-indexed _path field to speedup traversal Key: OAK-10671 URL: https://issues.apache.org/jira/browse/OAK-10671 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos
Regex path filtering currently is implemented with a condition like: {noformat} _id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$] OR ('_id' in [^[0-9]{1,3}:h.*$}] AND _path in [^\Q/foo/bar/\E.*$] {noformat} The second condition is necessary to deal with long path documents, whose {{_id}} is an hash instead of the path of the document, and that have an additional {{_path}} property with the full path of the document. The {{_id}} field is part of the index used by the query, but {{_path}} is not indexed. So the performance of this query will be very sensitive to how many time the query condition can be resolved without having to lookup the value of {{{}_path{}}}, which requires retrieving the full document from the column store. If the condition can be evaluated only using the {{_id}} value, them if there is no match the document should not be retrieved from the column store. Unfortunately, Mongo does not seem to properly optimize this query and is retrieving the document from the column storage even when {{_id}} does not match the path /foo/bar and the _id is not in the hash format. This leads to very poor performance as both the index and the column store have to be fully read by this query. We can instead use the following condition: {noformat} _id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$ , ^[0-9]{1,3}:h.*$}] {noformat} That is, download the document if the _id matches the path or if it is an hash. This has the disadvantage that it will download all long path documents from the repository, many of which might not be needed. However, this query condition only uses the _id field so it is guaranteed to be evaluated fully using only the data on the index. And the number of long paths documents is usually very small, some environments don't even have any long path documents, so downloading them should not take much time. And the indexing job will anyway reapply the filter on paths locally, to eliminate the long path documents which are not required by the indexing job. -- This message was sent by Atlassian Jira (v8.20.10#820010)