[ 
https://issues.apache.org/jira/browse/OAK-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150079#comment-17150079
 ] 

Ashish Chopra edited comment on OAK-8523 at 7/2/20, 8:43 AM:
-------------------------------------------------------------

{quote}
bq. webpage-as-primary-key-cache
Sorry I don't know what that means.
{quote}
That was my way of differentiating between the two "types" of structures that 
can store the reference-cache in question.
Following is what I called a "reference-as-primary-key-cache":
{noformat}
/var
  + ref-cache
    + a
      + b
       + my-img.jpg
         + my.site.example.com
           - references = {/my/page/1, my/page/2}
         + myother.site.example.com
           - references = {/my/page/11}
{noformat}

What follows next is what I called "webpage-as-primary-key-cache"
{noformat}
/var
  + ref-cache
    + my.site.example.com
      + my
        + page
          + 1
            - resources = {/a/b/my-img.jpg, ...}
          + 2
            - resources = {/a/b/my-img.jpg, ...}
    + myother.site.example.com
      + my
        + page
          + 11
            - resources = {/a/b/my-img.jpg, ...}
{noformat}

(I didn't use "source"/"target" terminology because I think it is easy to get 
confused - sorry).

bq. if you use a JCR "reference", e.g. using [Node.setProperty(String, 
Node)|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Node)],
 then you can get the list of references using 
[Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()]...
 Internally it is running a query btw.
Thanks for this suggestion!
AIU these APIs rely on jcr:uuid of the resource (i.e., the "source"), right? If 
yes, do you think we should worry about the potential bloat of the UUID index?
bq. The second option is to use a query (not a fulltext query, just a regular 
query).
As in a query on "resources" property in the webpage-as-primary-key-cache 
(defined above), right? Something like:
{noformat}
select * from [nt:base] as a where a.[resources]='/a/b/my-img.jpg' and 
isdescendantnode(a, '/var/ref-cache')
{noformat}

{quote}
bq. with resource-as-primary-key-cache, we just traverse right to the resource 
we need to find the reference for and read the prop - no search required.
On the other hand, you have many downsides, like the risk of having too many 
references.
...
you can also store paths. In which case you can't use the out-of-the-box 
feature and have to run the query yourself.
{quote}
It seems there's no easy answer here as there are tradeoffs involved - 
somethings are better if limits of the system/repository aren't expected to be 
reached/breached, others might have better scale characteristics but come at a 
cost of a slight increase in complexity and increased dependence on 
query-subsystem (and of course additional index-definiton+data if JCR 
references can't be used).

bq. in summary, I think the warning for large string properties and many 
entries in a multi-value property are justified... In my view, the added 
complexity to run a query, or to use uuids, is worth the trouble to avoid those 
problems.
Thanks for the discussion and recommendations [~thomasm]!
I think I'd like to go back and perhaps get a more concrete idea of the scale 
we'd be operating at, and perhaps see if the warnings start to appear while 
operating at that scale.

----
I have just one additional question (which I wrote above as well):
If I start using [Node.setProperty(String, 
Node)|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Node)],
 and retrieve list of references via 
[Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()]
 as you recommended, should I worry about potential UUID index bloat?


was (Author: ashishc):
{quote}
bq. webpage-as-primary-key-cache
Sorry I don't know what that means.
{quote}
That was my way of differentiating between the two "types" of structures that 
can store the reference-cache in question.
Following is what I called a "reference-as-primary-key-cache":
{noformat}
/var
  + ref-cache
    + a
      + b
       + my-img.jpg
         + my.site.example.com
           - references = {/my/page/1, my/page/2}
         + myother.site.example.com
           - references = {/my/page/11}
{noformat}

What follows next is what I called "webpage-as-primary-key-cache"
{noformat}
/var
  + ref-cache
    + my.site.example.com
      + my
        + page
          + 1
            - resources = {/a/b/my-img.jpg, ...}
          + 2
            - resources = {/a/b/my-img.jpg, ...}
    + myother.site.example.com
      + my
        + page
          + 11
            - resources = {/a/b/my-img.jpg, ...}
{noformat}

(I didn't use "source"/"target" terminology because I think it is easy to get 
confused - sorry).

bq. if you use a JCR "reference", e.g. using [Node.setProperty(String, 
Node)|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Node)],
 then you can get the list of references using 
[Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()]...
 Internally it is running a query btw.
Thanks for this suggestion!
AIU these APIs rely on jcr:uuid of the resource (i.e., the "source"), right? If 
yes, do you think we should worry about the potential bloat of the UUID index?
bq. The second option is to use a query (not a fulltext query, just a regular 
query).
As in a query on "resources" property in the webpage-as-primary-key-cache 
(defined above), right? Something like:
{noformat}
select * from [nt:base] as a where a.[resources]='/a/b/my-img.jpg' and 
isdescendantnode(a, '/var')
{noformat}

{quote}
bq. with resource-as-primary-key-cache, we just traverse right to the resource 
we need to find the reference for and read the prop - no search required.
On the other hand, you have many downsides, like the risk of having too many 
references.
...
you can also store paths. In which case you can't use the out-of-the-box 
feature and have to run the query yourself.
{quote}
It seems there's no easy answer here as there are tradeoffs involved - 
somethings are better if limits of the system/repository aren't expected to be 
reached/breached, others might have better scale characteristics but come at a 
cost of a slight increase in complexity and increased dependence on 
query-subsystem (and of course additional index-definiton+data if JCR 
references can't be used).

bq. in summary, I think the warning for large string properties and many 
entries in a multi-value property are justified... In my view, the added 
complexity to run a query, or to use uuids, is worth the trouble to avoid those 
problems.
Thanks for the discussion and recommendations [~thomasm]!
I think I'd like to go back and perhaps get a more concrete idea of the scale 
we'd be operating at, and perhaps see if the warnings start to appear while 
operating at that scale.

----
I have just one additional question (which I wrote above as well):
If I start using [Node.setProperty(String, 
Node)|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#setProperty(java.lang.String,%20javax.jcr.Node)],
 and retrieve list of references via 
[Node.getReferences|https://docs.adobe.com/docs/en/spec/jsr170/javadocs/jcr-2.0/javax/jcr/Node.html#getReferences()]
 as you recommended, should I worry about potential UUID index bloat?

> Best Practices - Property Value Length Limit
> --------------------------------------------
>
>                 Key: OAK-8523
>                 URL: https://issues.apache.org/jira/browse/OAK-8523
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: core, jcr
>            Reporter: Thomas Mueller
>            Priority: Major
>
> Right now, Oak supports very large properties (e.g. String). But 1 MB (or 
> larger) properties are problematic in multiple areas like indexing. It is 
> more important for software-as-a-service, where we need to guarantee SLOs, 
> but it also helps other cases. So we should:
> * (1) Document best practises, e.g. "Property values should be smaller than 
> 100 KB".
> * (2) Introduce "softLimit" and "hardLimit", where softLimit is e.g. 100 KB 
> and hardLimit is configurable, and (initially) by default Integer.MAX_VALUE. 
> Setting the hard limits to a lower value by default is problematic, because 
> it can break existing applications. With default value infinity, customers 
> can set lower limits e.g. in tests first, and once they are happy, in 
> production as well.
> * (3) Log a warning if a property is larger than "softLimit". To avoid 
> logging many warnings (if there are many such properties) we then set 
> softLimit = softLimit * 1.1 (reset to 100 KB in the next repository start). 
> Logging is needed to know what _exactly_ is broken (path, stack trace of the 
> actual usage...)
> * (4) Add a metric (monitoring) for detected large properties. Just logging 
> warnings might not be enough.
> * (5) Throttling: we could add flow control (pauses; Thread.sleep) after 
> violations, to improve isolation (to prevent affecting other threads that 
> don't violate the contract).
> * (6) We could expose the violation info in the session, so a framework could 
> check that data after executing custom code, and add more info (e.g. log).
> * (7) If larger than the configurable hardLimit, fail the commit or reject 
> setProperty (throw an exception).
> * (8) At some point, in a new Oak version, change the default value for 
> hardLimit to some reasonable number, e.g. 1 MB.
> The "property length" is just one case. There are multiple candidates:
>         
> * Number of properties for a node
> * Number of elements for multi-valued properties
> * Total size of a node (including inlined properties)
> * Number of direct child nodes for orderable child nodes
> * Number of direct child nodes for non-orderable child nodes
> * Size of transaction
> * Adding observations listeners that listen for all changes (global listeners)
> For those cases, new Jira issue should be made.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to