On 10 Nov 2010, at 00:09, Jos Snellings wrote:

> Thank you for your prompt answer, Ian.
> You mean "the natural way".
> That would be true for a citizen.
> That would be true for a community, so a path could be Stockholm/234987488.
> But to extract a regional indicator, like 'how many applications were handled 
> on time during the first half of 2014'. This is something that is not 
> requested in the first place,
> but I know it *will* come up.  ==> then the user performing this query would 
> have read access on all files. Would the query scale better?

> 'how many applications were handled on time during the first half of 2014' 
implies a date range.
IIRC date ranges are problematic in Lucene and although the query might be Ok 
from a sparse search point of view, the date range might cause a problem. Again 
experimentation before committing to implementation is going to remove more of 
the risk.
Ian


> 
> Thanks,
> Jos
> 
> On 11/09/2010 07:56 PM, Ian Boston wrote:
>> 
>> On 9 Nov 2010, at 13:11, Jos Snellings wrote:
>> 
>>   
>>> You are right, Ian,
>>> 
>>> This question deserves a new thread.
>>> Currently I am drawing up an architecture for a file handling system for 
>>> e-government:
>>> permissions are scattered up to:
>>> - the citizen : one active file for a citizen (= folder, infoholder in xml, 
>>> attachments)
>>> - the community :  visibility and handling for the citizens of one community
>>> - the regional authority : regional indicators
>>> 
>>> This worries me for it is a typical case where you would run into 
>>> scalability problems.
>>> Think of 50 000 open applications via that system. With 10 documents per 
>>> application
>>> you would have 500 000.
>>>     
>> If 1 user only has access to 10 applications, then doing a search that finds 
>> 500,000 applications only to return 10 readable ones would not scale, just 
>> as a table scan on a RDBMS table containing .5M rows with no index would 
>> also not scale.
>> 
>> 
>> 
>>   
>>> Is that a nogo for Sling? Would be a pity. I wanted to come up with an 
>>> elegant solution :-)
>>>     
>> 
>> Sling is not the issue here, its Jackrabbit, and knowing that the above 
>> situation does not scale you would do 2 things.
>> Never use that type of search.
>> 
>> Access all data via pointers and paths into the data based on something that 
>> was not a search. eg if the application was 2919100291
>> you might find the application and all the information in
>> /applications/29/19/10/2919100291
>> 
>> and if the user had an ID of e31231231432
>> they might have a folder
>> /users/e3/12/31/23/1432
>>      with a sub folder
>>          2919100291
>> 
>> containing a property
>>               egov:application-path : /applications/29/19/10/2919100291
>> 
>> 
>> 
>> ie you have to model your data to avoid searches and non direct access 
>> pathways,
>> 
>> but......
>> 
>> Please
>> ask on [email protected] as the committers there will be able to give you 
>> a complete and honest answer to if Jackrabbit is a No Go.
>> and
>> do some tests to prove to yourself that it will work at the scale that you 
>> want.
>> 
>> (bash + curl + sling is a good way of doing these sort of tests)
>> 
>> 
>> 
>>   
>>> Jos
>>> 
>>> 
>>> 
>>> 
>>> On 11/09/2010 09:22 AM, Ian Boston wrote:
>>>     
>>>> Jos,
>>>> If by result you mean a search result, then thats a separate issue from 
>>>> the dynamic ACL itself, and not the direct subject of this thread. When I 
>>>> said performance I was referring to the atomic act of determining if the 
>>>> ACE was active for any attempt to access an item, not just search results.
>>>> 
>>>> 
>>>> However,
>>>> thats the way jackrabbit works.
>>>> JCR searches are "compiled" into Lucene Queries that generate Lucene Hits 
>>>> where the Lucene document contains a node ID, which is extracted in the 
>>>> normal manner from JCR (IIRC). If the current user cant read the item, its 
>>>> discarded.
>>>> 
>>>> This is fine for dense searches where most items can be read by the user, 
>>>> but problematic for sparse searches.
>>>> Its also problematic for sorts that can't be performed inside Lucene, as 
>>>> this results in all the items being loaded into memory before searching.
>>>> One way to avoid sorts of this form is to ban "order by" clauses that 
>>>> reference any items other than properties of the node found.
>>>> 
>>>> 
>>>> BTW, problematic == non scalable, vertically or horizontally.
>>>> Ian
>>>> 
>>>>       
>>>     
>> 
>>   
> 

Reply via email to