Thank you, Ian !
I am writing the proposal as a warned subject.
Jos
On 11/10/2010 10:05 AM, Ian Boston wrote:
On 10 Nov 2010, at 00:09, Jos Snellings wrote:
Thank you for your prompt answer, Ian.
You mean "the natural way".
That would be true for a citizen.
That would be true for a community, so a path could be Stockholm/234987488.
But to extract a regional indicator, like 'how many applications were handled
on time during the first half of 2014'. This is something that is not requested
in the first place,
but I know it *will* come up. ==> then the user performing this query would
have read access on all files. Would the query scale better?
'how many applications were handled on time during the first half of 2014'
implies a date range.
IIRC date ranges are problematic in Lucene and although the query might be Ok
from a sparse search point of view, the date range might cause a problem. Again
experimentation before committing to implementation is going to remove more of
the risk.
Ian
Thanks,
Jos
On 11/09/2010 07:56 PM, Ian Boston wrote:
On 9 Nov 2010, at 13:11, Jos Snellings wrote:
You are right, Ian,
This question deserves a new thread.
Currently I am drawing up an architecture for a file handling system for
e-government:
permissions are scattered up to:
- the citizen : one active file for a citizen (= folder, infoholder in xml,
attachments)
- the community : visibility and handling for the citizens of one community
- the regional authority : regional indicators
This worries me for it is a typical case where you would run into scalability
problems.
Think of 50 000 open applications via that system. With 10 documents per
application
you would have 500 000.
If 1 user only has access to 10 applications, then doing a search that finds
500,000 applications only to return 10 readable ones would not scale, just as a
table scan on a RDBMS table containing .5M rows with no index would also not
scale.
Is that a nogo for Sling? Would be a pity. I wanted to come up with an elegant
solution :-)
Sling is not the issue here, its Jackrabbit, and knowing that the above
situation does not scale you would do 2 things.
Never use that type of search.
Access all data via pointers and paths into the data based on something that
was not a search. eg if the application was 2919100291
you might find the application and all the information in
/applications/29/19/10/2919100291
and if the user had an ID of e31231231432
they might have a folder
/users/e3/12/31/23/1432
with a sub folder
2919100291
containing a property
egov:application-path : /applications/29/19/10/2919100291
ie you have to model your data to avoid searches and non direct access pathways,
but......
Please
ask on [email protected] as the committers there will be able to give you a
complete and honest answer to if Jackrabbit is a No Go.
and
do some tests to prove to yourself that it will work at the scale that you want.
(bash + curl + sling is a good way of doing these sort of tests)
Jos
On 11/09/2010 09:22 AM, Ian Boston wrote:
Jos,
If by result you mean a search result, then thats a separate issue from the
dynamic ACL itself, and not the direct subject of this thread. When I said
performance I was referring to the atomic act of determining if the ACE was
active for any attempt to access an item, not just search results.
However,
thats the way jackrabbit works.
JCR searches are "compiled" into Lucene Queries that generate Lucene Hits where
the Lucene document contains a node ID, which is extracted in the normal manner from JCR
(IIRC). If the current user cant read the item, its discarded.
This is fine for dense searches where most items can be read by the user, but
problematic for sparse searches.
Its also problematic for sorts that can't be performed inside Lucene, as this
results in all the items being loaded into memory before searching.
One way to avoid sorts of this form is to ban "order by" clauses that reference
any items other than properties of the node found.
BTW, problematic == non scalable, vertically or horizontally.
Ian