[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149570#comment-17149570
 ] 

Ashish Chopra edited comment on OAK-8523 at 7/1/20, 5:16 PM:
-------------------------------------------------------------

Thanks for reviewing the usecase and providing suggestions [~thomasm]!
bq. If you update the list of references a lot, then storing it as a binary or 
a multi-valued property is problematic, if the list is large. In this case, the 
whole list is stored over and over again.
I don't expect the list of references to be updated very frequently - perhaps 
once every 15-20min for a given reference - the whole subtree (containing the 
path of resources whose references are updated in the sites - i.e, the 
reference-cache's "primary-key") could see more frequent updates, though - but 
still in the order of 10^1 minutes, no sooner.
bq. It might be better to store a multi-valued property at the place where the 
reference is (the source). And not in the target.
IIUC, you mean a structure like:
{noformat}
/var
  + my.site.example.com
    + my
      + page
        + 1
          - resources = {/a/b/my-img.jpg, ...}
        + 2
          - resources = {/a/b/my-img.jpg, ...}
  + myother.site.example.com
    + my
      + page
        + 11
          - resources = {/a/b/my-img.jpg, ...}
{noformat}
which is a webpage-as-primary-key-cache, right?
bq. Because usually (I assume) a source only references a small set of targets. 
On the other hand, a target can be referenced by a huge number of sources. 
Right? And then have a index on that property. That's it. Indexes are updated 
efficiently.
That's not entirely true IME (i.e., the number wouldn't be "small" in absolute 
sense) - but yes, it is _expected_ to be _smaller than_ the previous scenario.
However, as you've already touched upon, we didn't go down this road because 
lookup in the webpage-as-primary-key-cache would need a (full-text?) query 
(irrespective of whether we do it via MV string props or (potentially in-lined) 
binary-props). OTOH, with resource-as-primary-key-cache, we just traverse right 
to the resource we need to find the reference for and read the prop - no search 
required.
Another issue was updating the 'deletions'. We can only get the webpages where 
the resource has been referenced at _any given time_. If we create a 
webpage-as-primary-key-cache, we'd need to issue full-text search through the 
whole cache to 'invalidate' the pages which _earlier_ had the 
resource-reference but no longer do. With resource-as-primary-key-cache we just 
delete the existing cache-entry and recreate it with the new data.
bq. In [case of moves, copies, or deletes a source or target?] the references 
can easily get outdated. What about using UUIDs instead of paths (jcr:uuid)?
Sadly, using jcr:uuid is not an option because we're dealing with (somewhat) 
legacy stuff - designed when IDs were considered evil [0] and thus the webpages 
contain paths as resource-reference instead of IDs.

To answer the question, though:
Considering resource ({{/a/b/my-img.jpg}}:
* for moves:
** the references (webpages) will be adjusted and cache entries will be 
relocated to align with new resource-path (i.e., the key will be 'renamed').
* for copies
** this is effectively a new resource - the references will be created as 
content is authored and cache entries will be build accordingly
* for deletes
** the cache entry will be invalidated

Considering webpage ({{https://my.site.example.com/my/page/1.html}})
* for moves/copies/deletes:
** the cache-entries for resources referenced within the webpage being deleted 
will be invalidated and a rebuild issued


[0] 
https://www.slideshare.net/jukka/content-storage-with-apache-jackrabbit/19-Content_modeling_Davids_model1_Data
 (see rule #7)


was (Author: ashishc):
Thanks for reviewing the usecase and providing suggestions [~thomasm]!
bq. If you update the list of references a lot, then storing it as a binary or 
a multi-valued property is problematic, if the list is large. In this case, the 
whole list is stored over and over again.
I don't expect the list of references to be updated very frequently - perhaps 
once every 15-20min for a given reference - the whole subtree (containing the 
path of resources whose references are updated in the sites - i.e, the 
reference-cache's "primary-key") could see more frequent updates, though - but 
still in the order of 10^1 minutes, no sooner.
bq. It might be better to store a multi-valued property at the place where the 
reference is (the source). And not in the target.
IIUC, you mean a structure like:
{noformat}
/var
  + my.site.example.com
    + my
      + page
        + 1
          - resources = {/a/b/my-img.jpg, ...}
        + 2
          - resources = {/a/b/my-img.jpg, ...}
  + myother.site.example.com
    + my
      + page
        + 11
          - resources = {/a/b/my-img.jpg, ...}
{noformat}
which is a webpage-as-primary-key-cache, right?
bq. Because usually (I assume) a source only references a small set of targets. 
On the other hand, a target can be referenced by a huge number of sources. 
Right? And then have a index on that property. That's it. Indexes are updated 
efficiently.
That's not entirely true IME (i.e., the number wouldn't be "small" in absolute 
sense) - but yes, it is _expected_ to be _smaller than_ the previous scenario.
However, as you've already touched upon, we didn't go down this road because 
lookup in the cache would need a (full-text?) query (irrespective of whether we 
do it via MV string props or (potentially in-lined) binary-props). OTOH, with 
resource-as-primary-key-cache, we just traverse right to the resource we need 
to find the reference for and read the prop - no search required.
Another issue was updating the 'deletions'. We can only get the webpages where 
the resource has been referenced at _any given time_. If we create a 
webpage-as-primary-key-cache, we'd need to issue full-text search through the 
whole cache to 'invalidate' the pages which _earlier_ had the 
resource-reference but no longer do. With resource-as-primary-key-cache we just 
delete the existing cache-entry and recreate it with the new data.
bq. In [case of moves, copies, or deletes a source or target?] the references 
can easily get outdated. What about using UUIDs instead of paths (jcr:uuid)?
Sadly, using jcr:uuid is not an option because we're dealing with (somewhat) 
legacy stuff - designed when IDs were considered evil [0] and thus the webpages 
contain paths as resource-reference instead of IDs.

To answer the question, though:
Considering resource ({{/a/b/my-img.jpg}}:
* for moves:
** the references (webpages) will be adjusted and cache entries will be 
relocated to align with new resource-path (i.e., the key will be 'renamed').
* for copies
** this is effectively a new resource - the references will be created as 
content is authored and cache entries will be build accordingly
* for deletes
** the cache entry will be invalidated

Considering webpage ({{https://my.site.example.com/my/page/1.html}})
* for moves/copies/deletes:
** the cache-entries for resources referenced within the webpage being deleted 
will be invalidated and a rebuild issued


[0] 
https://www.slideshare.net/jukka/content-storage-with-apache-jackrabbit/19-Content_modeling_Davids_model1_Data
 (see rule #7)

> Best Practices - Property Value Length Limit
> --------------------------------------------
>
>                 Key: OAK-8523
>                 URL: https://issues.apache.org/jira/browse/OAK-8523
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: core, jcr
>            Reporter: Thomas Mueller
>            Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
>         
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to