Nuno Santos created OAK-10671:
---------------------------------

             Summary: [Indexing job] - Improve Mongo regex query: remove 
condition on non-indexed _path field to speedup traversal
                 Key: OAK-10671
                 URL: https://issues.apache.org/jira/browse/OAK-10671
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: indexing
            Reporter: Nuno Santos


Regex path filtering currently is implemented with a condition like:
{noformat}
_id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$] OR ('_id' in [^[0-9]{1,3}:h.*$}] AND 
_path in [^\Q/foo/bar/\E.*$]
{noformat}
The second condition is necessary to deal with long path documents, whose 
{{_id}} is an hash instead of the path of the document, and that have an 
additional {{_path}} property with the full path of the document. The {{_id}} 
field is part of the index used by the query, but {{_path}} is not indexed. So 
the performance of this query will be very sensitive to how many time the query 
condition can be resolved without having to lookup the value of {{{}_path{}}}, 
which requires retrieving the full document from the column store. If the 
condition can be evaluated only using the {{_id}} value, them if there is no 
match the document should not be retrieved from the column store.

Unfortunately, Mongo does not seem to properly optimize this query and is 
retrieving the document from the column storage even when {{_id}} does not 
match the path /foo/bar and the _id is not in the hash format. This leads to 
very poor performance as both the index and the column store have to be fully 
read by this query.

We can instead use the following condition:
{noformat}
_id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$ , ^[0-9]{1,3}:h.*$}]
{noformat}
That is, download the document if the _id matches the path or if it is an hash. 
This has the disadvantage that it will download all long path documents from 
the repository, many of which might not be needed. However, this query 
condition only uses the _id field so it is guaranteed to be evaluated fully 
using only the data on the index. And the number of long paths documents is 
usually very small, some environments don't even have any long path documents, 
so downloading them should not take much time. And the indexing job will anyway 
reapply the filter on paths locally, to eliminate the long path documents which 
are not required by the indexing job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to