Hi Ryan (and all),

Had a moment this morning to dig a little deeper here...

 From what I can tell, it looks like this *may* be the result of a 
flaw/bug in the logic of the Discovery "Access Rights Awareness" feature 
(which is supposed to respect access restrictions on Items).

I believe what may be going on is the following:

1. Discovery sees the Item as being "Anonymous READ" so it makes the 
Item metadata searchable to anonymous users (which is appropriate in 
most scenarios obviously)

2. However, it looks like Discovery may not *check* to see if any Files 
(Bitstreams) are more tightly restricted. So, based on my skimming the 
code, it looks like it assumes that: "If the Item is Anonymous READ, 
then all its Files should just be indexed & searchable". In your 
scenario, this is an obviously wrong assumption as it results in your 
restricted PDF being publicly searchable (and a snippet of that 
restricted PDFs text appears in the search results)

Again though, this is me just *skimming the code* (links below for 
interested developers). I might be misunderstanding something here.

I'm copying in @mire staff (since they helped build this new Discovery 
"Access Rights Awareness" feature into DSpace 4.x). Kevin or Bram, am I 
understanding the code here properly? Have you ever encountered this 
before or know of a workaround/fix?

Thanks,

Tim


Relevant Code Links:
---------------------
* SolrServiceResourceRestrictionPlugin seems to be the class that access 
restricts certain objects in Discovery/Solr queries, but it only seems 
to be used at the ITEM level (and never for individual Bitstreams): 
https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/discovery/SolrServiceResourceRestrictionPlugin.java

* SolrServiceImpl is what actually indexes the extracted TEXT 
bitstreams. But, from what I can tell, it NEVER checks to see if the 
extracted TEXT is access restricted (i.e. it just assumes the extracted 
text has the same permissions as the overall item). Here's that area of 
the code: 
https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/discovery/SolrServiceImpl.java#L1370



On 1/12/2015 3:59 PM, Steans, Ryan J wrote:
> Hi all,
>
> I mailed earlier today, but it doesn’t look like the mail went through.
> Apologies if you hear from me twice in one day on this same topic.
>
> Here’s our situation –
>
> We have a user placing PDFs into DSpace as part of a complex item with
> multiple files and file types.  She has set some of bitstreams to be
> hidden, but once the text has been extracted, despite the fact the
> actual PDF and TXT file are hidden, a search will turn up the extracted
> text.
>
> So – If the name “Ryan Steans” was in a PDF, but that PDF was hidden –
> the PDF might be hidden, but my search result in DSpace would turn up
> “Ryan Steans” and about 500 characters of text surround that name.
>
> Some additional details on our use case -
>
>   * The item itself is public and is set to anonymous READ.
>   * This particular item has 1 Mp4 and 3 PDF's as bitstreams, all except
>     for 1 PDF are set to anonymous READ.
>   * None of the bitstreams are set as the primary bitstream for the item.
>   * the 1 PDF that is set to restricted READ is the one that the media
>     filter is parsing and inserting into the "fulltext" value in solr...
>     the other 2 PDF's are not being indexed as fulltext and their
>     contents are not searchable through Discovery (or searchable in SOLR).
>   * The generated TXT file from the PDF has the same permissions
>     (restricted) as the original source bitstream.
>
> The problem is that if you happen to search for anything in the fulltext
> of the restricted item, it will show up in the results and the first
> ~500 chars of the parsed-restricted-text file are displayed in the
> search results.
>
> Looking to see if this is something anyone else has seen.
>
> Is this an indexing problem?  Have we found a bug?
>
> thanks
>
> *Ryan Steans*
>
> Director of Operations
>
> Texas Digital Library
>
> 512-495-4403
>
> Web: http://www.tdl.org/
>
> Twitter: @TxDigLibrary <http://twitter.com/TXDigLibrary>
>
> Facebook: http://www.facebook.com/texasdigitallibrary
>
> Join the e-mail list: http://tdl.org/news/newsletters/newsletter-signup/
>
> **
>
>
>
> ------------------------------------------------------------------------------
> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
> GigeNET is offering a free month of service with a new server in Ashburn.
> Choose from 2 high performing configs, both with 100TB of bandwidth.
> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
> www.gigenet.com
>
>
>
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> List Etiquette: 
> https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
>

------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to