On 10 Nov 2010, at 00:09, Jos Snellings wrote: > Thank you for your prompt answer, Ian. > You mean "the natural way". > That would be true for a citizen. > That would be true for a community, so a path could be Stockholm/234987488. > But to extract a regional indicator, like 'how many applications were handled > on time during the first half of 2014'. This is something that is not > requested in the first place, > but I know it *will* come up. ==> then the user performing this query would > have read access on all files. Would the query scale better?
> 'how many applications were handled on time during the first half of 2014' implies a date range. IIRC date ranges are problematic in Lucene and although the query might be Ok from a sparse search point of view, the date range might cause a problem. Again experimentation before committing to implementation is going to remove more of the risk. Ian > > Thanks, > Jos > > On 11/09/2010 07:56 PM, Ian Boston wrote: >> >> On 9 Nov 2010, at 13:11, Jos Snellings wrote: >> >> >>> You are right, Ian, >>> >>> This question deserves a new thread. >>> Currently I am drawing up an architecture for a file handling system for >>> e-government: >>> permissions are scattered up to: >>> - the citizen : one active file for a citizen (= folder, infoholder in xml, >>> attachments) >>> - the community : visibility and handling for the citizens of one community >>> - the regional authority : regional indicators >>> >>> This worries me for it is a typical case where you would run into >>> scalability problems. >>> Think of 50 000 open applications via that system. With 10 documents per >>> application >>> you would have 500 000. >>> >> If 1 user only has access to 10 applications, then doing a search that finds >> 500,000 applications only to return 10 readable ones would not scale, just >> as a table scan on a RDBMS table containing .5M rows with no index would >> also not scale. >> >> >> >> >>> Is that a nogo for Sling? Would be a pity. I wanted to come up with an >>> elegant solution :-) >>> >> >> Sling is not the issue here, its Jackrabbit, and knowing that the above >> situation does not scale you would do 2 things. >> Never use that type of search. >> >> Access all data via pointers and paths into the data based on something that >> was not a search. eg if the application was 2919100291 >> you might find the application and all the information in >> /applications/29/19/10/2919100291 >> >> and if the user had an ID of e31231231432 >> they might have a folder >> /users/e3/12/31/23/1432 >> with a sub folder >> 2919100291 >> >> containing a property >> egov:application-path : /applications/29/19/10/2919100291 >> >> >> >> ie you have to model your data to avoid searches and non direct access >> pathways, >> >> but...... >> >> Please >> ask on [email protected] as the committers there will be able to give you >> a complete and honest answer to if Jackrabbit is a No Go. >> and >> do some tests to prove to yourself that it will work at the scale that you >> want. >> >> (bash + curl + sling is a good way of doing these sort of tests) >> >> >> >> >>> Jos >>> >>> >>> >>> >>> On 11/09/2010 09:22 AM, Ian Boston wrote: >>> >>>> Jos, >>>> If by result you mean a search result, then thats a separate issue from >>>> the dynamic ACL itself, and not the direct subject of this thread. When I >>>> said performance I was referring to the atomic act of determining if the >>>> ACE was active for any attempt to access an item, not just search results. >>>> >>>> >>>> However, >>>> thats the way jackrabbit works. >>>> JCR searches are "compiled" into Lucene Queries that generate Lucene Hits >>>> where the Lucene document contains a node ID, which is extracted in the >>>> normal manner from JCR (IIRC). If the current user cant read the item, its >>>> discarded. >>>> >>>> This is fine for dense searches where most items can be read by the user, >>>> but problematic for sparse searches. >>>> Its also problematic for sorts that can't be performed inside Lucene, as >>>> this results in all the items being loaded into memory before searching. >>>> One way to avoid sorts of this form is to ban "order by" clauses that >>>> reference any items other than properties of the node found. >>>> >>>> >>>> BTW, problematic == non scalable, vertically or horizontally. >>>> Ian >>>> >>>> >>> >> >> >
