RE: indexing documents (or pieces of a document) by access controls
Hello, When I had those kind of problems (less complex) with lucene, the only idea was to filter from the front-end, according to the ACL policy. Lucene docs and fields weren't protected, but tagged. Searching was always applied with a field audience, with hierarchical values like public, reserved, protected, secret, so that a public document has the secret value also, to be found with a audience:secret, according to the rights of the user who searchs. For the fields, the not allowed ones for some users where striped. Yes I know this is a possibility...but we happen to want our authorisation facetted based. I am attacking the problem with keeping derived data from lucene in memory all translated into some byte/int values. The hardest part is keeping the derived data in sink with lucene *and* the different jackrabbit users (some have changes in there session but not yet saved their data) Anyway, I can do facetted authorisation + counting in less than 20 ms for 1.000.000 documents (normal pc) so hopefully I can succeed. I must admit OTH, that I did not find some sort of ingenious algorithm, but merely depend on the speed of the processor: doubling the number of documents means doubling the response time and needed memory (though 1.000.000 doc fitted in 25 Mb, so 40.000.000 in a Gb...that is fine by me) May be you can have a look to the xmldb Exist ? The search engine, xquery based, is not focused on the same goals as lucene, but I can promise you that all queries will never return results from documents you are not allowed to read. I did not look at it, but my feeling is that it is not fast enough, Regards Ard -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
RE: indexing documents (or pieces of a document) by access controls
Hello, Given the requirement to break down a document into separately controlled pieces, I'd create a servlet that fronts the Solr servlet and handles this conversion. I could think of ways to do it using Solr, but they feel like unnatural acts. As a general comment on ACLs, one relatively easy way to handle this is via group ids that you use to restrict the query. Each document has a groupid with a list of group ids that are authorized to access it. Each user query is converted into a (query) AND (groupid:xx OR groupid:yy), where xx/yy (and so on) are the groups that the user belongs to. With all do respect, I really think the problem is largely underestimated here, and is far more complex then these suggestions...unless we are talking about 100.000 documents, couple of users, and updating ones a day. If you want millions of documents, facetted authorized navigation including counting and every second a new indexed document which should be reflected in the result instantly and changing authorisationsthe problem isn't relatively easy to solve anymore :-) Regards Ard -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
RE: indexing documents (or pieces of a document) by access controls
Hello, Hi And about the fields, if they are/aren't going to be present on the responses based on the user group, you can do it in many different ways (using XML transformation to remove the undesirable fields, implementing your own RequestHandler able to process your group information, filtering the data and showing only what should be shown to the user, ...) So suppose, you want to see 10 documents, but on average you are authorized to see 1 in 100 docs. Then on average, you need to fetch 100 docs to find 10 results...100 XML transformationsthat will be slow. And I left out the fact that you still do not know the number of pages that user is allowed to see, the counting if you want facetted navigation, etc etc Regards Ard Regards, Daniel On 12/6/07 16:14, Ken Krugler [EMAIL PROTECTED] wrote: Hi all, Can anyone give me some advice on breaking a document up and indexing it by access control lists. What we have are xml documents that are transformed based on the user viewing it. Some users might see all of the document, while other may see a few fields, and yet others see nothing at all. The access control lists may be a role the user belongs to, it may be a list of groups, or even a combination of the two. I can transform the xml to the plain text that I want to index, and key it off of the acls and then pass along a list of acls that the user issuing a query belongs to when searching. But I guess I'm not really sure how to do this the best way. Anyone have any thoughts? Given the requirement to break down a document into separately controlled pieces, I'd create a servlet that fronts the Solr servlet and handles this conversion. I could think of ways to do it using Solr, but they feel like unnatural acts. As a general comment on ACLs, one relatively easy way to handle this is via group ids that you use to restrict the query. Each document has a groupid with a list of group ids that are authorized to access it. Each user query is converted into a (query) AND (groupid:xx OR groupid:yy), where xx/yy (and so on) are the groups that the user belongs to. -- Ken http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: indexing documents (or pieces of a document) by access controls
Hello, With all do respect, I really think the problem is largely underestimated here, and is far more complex then these suggestions...unless we are talking about 100.000 documents, couple of users, and updating ones a day. If you want millions of documents, facetted authorized navigation including counting and every second a new indexed document which should be reflected in the result instantly and changing authorisationsthe problem isn't relatively easy to solve anymore :-) When I had those kind of problems (less complex) with lucene, the only idea was to filter from the front-end, according to the ACL policy. Lucene docs and fields weren't protected, but tagged. Searching was always applied with a field audience, with hierarchical values like public, reserved, protected, secret, so that a public document has the secret value also, to be found with a audience:secret, according to the rights of the user who searchs. For the fields, the not allowed ones for some users where striped. May be you can have a look to the xmldb Exist ? The search engine, xquery based, is not focused on the same goals as lucene, but I can promise you that all queries will never return results from documents you are not allowed to read. -- Frédéric Glorieux École nationale des chartes direction des nouvelles technologies et de l'informatique
RE: indexing documents (or pieces of a document) by access controls
Hello Nate, IMHO, you will not be able to do this in solr unless you accept pretty hard constraints on your ACLs (I will get back to this in a moment). IMO, it is not possible to index documents along with ACLs. ACLs can be very fine grained, and the thing you describe, ACL specific parts of a documentwell, I wouldn't know how you would index this. (imagine you change the ACL for a specific user. How do you know what to re-index and what not. Suppose you add a user? I really do not think it is possible based on fine grained ACLs). You also should realize you are trying to find an answer to an extremely complex problem: authorisation in an index (I am trying to develop facetted navigation in combination with authorisation in a lucene index in jackrabbit, but I think this is not the place to discuss it) So, in your case, if you want to use solr and some way of ACLs, I think basically you can only manage this if: 1) you ACLs are some sort of paths in a hiearchical based structure, where you index the hierarchical structure along with the content. Then when quering you have to include the folders that user is allowed to see 2) you need to keep bitset for each user which documents are allowed (but, you have even ACLs inside documents). Also, keeping bitsets up2date for many users is almost impossible, because - lucene document ids possible change after merging segments - updating documents might mean updating many many bitsets if you have many users For these reasons, I do not think you can achieve with solar what you want, unless you are going to work with something like: updating the index and ACL bitsets once a day. Regards Ard Can anyone give me some advice on breaking a document up and indexing it by access control lists. What we have are xml documents that are transformed based on the user viewing it. Some users might see all of the document, while other may see a few fields, and yet others see nothing at all. The access control lists may be a role the user belongs to, it may be a list of groups, or even a combination of the two. I can transform the xml to the plain text that I want to index, and key it off of the acls and then pass along a list of acls that the user issuing a query belongs to when searching. But I guess I'm not really sure how to do this the best way. Anyone have any thoughts? Thanks! Nate
RE: indexing documents (or pieces of a document) by access controls
Excuse me, I meant solr ofcourse :-) For these reasons, I do not think you can achieve with solar
Re: indexing documents (or pieces of a document) by access controls
Hi all, Can anyone give me some advice on breaking a document up and indexing it by access control lists. What we have are xml documents that are transformed based on the user viewing it. Some users might see all of the document, while other may see a few fields, and yet others see nothing at all. The access control lists may be a role the user belongs to, it may be a list of groups, or even a combination of the two. I can transform the xml to the plain text that I want to index, and key it off of the acls and then pass along a list of acls that the user issuing a query belongs to when searching. But I guess I'm not really sure how to do this the best way. Anyone have any thoughts? Given the requirement to break down a document into separately controlled pieces, I'd create a servlet that fronts the Solr servlet and handles this conversion. I could think of ways to do it using Solr, but they feel like unnatural acts. As a general comment on ACLs, one relatively easy way to handle this is via group ids that you use to restrict the query. Each document has a groupid with a list of group ids that are authorized to access it. Each user query is converted into a (query) AND (groupid:xx OR groupid:yy), where xx/yy (and so on) are the groups that the user belongs to. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
Re: indexing documents (or pieces of a document) by access controls
Hi And about the fields, if they are/aren't going to be present on the responses based on the user group, you can do it in many different ways (using XML transformation to remove the undesirable fields, implementing your own RequestHandler able to process your group information, filtering the data and showing only what should be shown to the user, ...) Regards, Daniel On 12/6/07 16:14, Ken Krugler [EMAIL PROTECTED] wrote: Hi all, Can anyone give me some advice on breaking a document up and indexing it by access control lists. What we have are xml documents that are transformed based on the user viewing it. Some users might see all of the document, while other may see a few fields, and yet others see nothing at all. The access control lists may be a role the user belongs to, it may be a list of groups, or even a combination of the two. I can transform the xml to the plain text that I want to index, and key it off of the acls and then pass along a list of acls that the user issuing a query belongs to when searching. But I guess I'm not really sure how to do this the best way. Anyone have any thoughts? Given the requirement to break down a document into separately controlled pieces, I'd create a servlet that fronts the Solr servlet and handles this conversion. I could think of ways to do it using Solr, but they feel like unnatural acts. As a general comment on ACLs, one relatively easy way to handle this is via group ids that you use to restrict the query. Each document has a groupid with a list of group ids that are authorized to access it. Each user query is converted into a (query) AND (groupid:xx OR groupid:yy), where xx/yy (and so on) are the groups that the user belongs to. -- Ken http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.