A followup to this thread. I've logged a bug for this discussion, as I'm 
fairly certain it *is* a bug.

https://jira.duraspace.org/browse/DS-2442

We still do need a volunteer to help us resolve this, hopefully even in 
time for 5.1.

- Tim

On 1/13/2015 10:42 AM, Tim Donohue wrote:
> Hi Ryan (and all),
>
> Had a moment this morning to dig a little deeper here...
>
>  From what I can tell, it looks like this *may* be the result of a
> flaw/bug in the logic of the Discovery "Access Rights Awareness" feature
> (which is supposed to respect access restrictions on Items).
>
> I believe what may be going on is the following:
>
> 1. Discovery sees the Item as being "Anonymous READ" so it makes the
> Item metadata searchable to anonymous users (which is appropriate in
> most scenarios obviously)
>
> 2. However, it looks like Discovery may not *check* to see if any Files
> (Bitstreams) are more tightly restricted. So, based on my skimming the
> code, it looks like it assumes that: "If the Item is Anonymous READ,
> then all its Files should just be indexed & searchable". In your
> scenario, this is an obviously wrong assumption as it results in your
> restricted PDF being publicly searchable (and a snippet of that
> restricted PDFs text appears in the search results)
>
> Again though, this is me just *skimming the code* (links below for
> interested developers). I might be misunderstanding something here.
>
> I'm copying in @mire staff (since they helped build this new Discovery
> "Access Rights Awareness" feature into DSpace 4.x). Kevin or Bram, am I
> understanding the code here properly? Have you ever encountered this
> before or know of a workaround/fix?
>
> Thanks,
>
> Tim
>
>
> Relevant Code Links:
> ---------------------
> * SolrServiceResourceRestrictionPlugin seems to be the class that access
> restricts certain objects in Discovery/Solr queries, but it only seems
> to be used at the ITEM level (and never for individual Bitstreams):
> https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/discovery/SolrServiceResourceRestrictionPlugin.java
>
>
> * SolrServiceImpl is what actually indexes the extracted TEXT
> bitstreams. But, from what I can tell, it NEVER checks to see if the
> extracted TEXT is access restricted (i.e. it just assumes the extracted
> text has the same permissions as the overall item). Here's that area of
> the code:
> https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/discovery/SolrServiceImpl.java#L1370
>
>
>
>
> On 1/12/2015 3:59 PM, Steans, Ryan J wrote:
>> Hi all,
>>
>> I mailed earlier today, but it doesn’t look like the mail went through.
>> Apologies if you hear from me twice in one day on this same topic.
>>
>> Here’s our situation –
>>
>> We have a user placing PDFs into DSpace as part of a complex item with
>> multiple files and file types.  She has set some of bitstreams to be
>> hidden, but once the text has been extracted, despite the fact the
>> actual PDF and TXT file are hidden, a search will turn up the extracted
>> text.
>>
>> So – If the name “Ryan Steans” was in a PDF, but that PDF was hidden –
>> the PDF might be hidden, but my search result in DSpace would turn up
>> “Ryan Steans” and about 500 characters of text surround that name.
>>
>> Some additional details on our use case -
>>
>>   * The item itself is public and is set to anonymous READ.
>>   * This particular item has 1 Mp4 and 3 PDF's as bitstreams, all except
>>     for 1 PDF are set to anonymous READ.
>>   * None of the bitstreams are set as the primary bitstream for the item.
>>   * the 1 PDF that is set to restricted READ is the one that the media
>>     filter is parsing and inserting into the "fulltext" value in solr...
>>     the other 2 PDF's are not being indexed as fulltext and their
>>     contents are not searchable through Discovery (or searchable in
>> SOLR).
>>   * The generated TXT file from the PDF has the same permissions
>>     (restricted) as the original source bitstream.
>>
>> The problem is that if you happen to search for anything in the fulltext
>> of the restricted item, it will show up in the results and the first
>> ~500 chars of the parsed-restricted-text file are displayed in the
>> search results.
>>
>> Looking to see if this is something anyone else has seen.
>>
>> Is this an indexing problem?  Have we found a bug?
>>
>> thanks
>>
>> *Ryan Steans*
>>
>> Director of Operations
>>
>> Texas Digital Library
>>
>> 512-495-4403
>>
>> Web: http://www.tdl.org/
>>
>> Twitter: @TxDigLibrary <http://twitter.com/TXDigLibrary>
>>
>> Facebook: http://www.facebook.com/texasdigitallibrary
>>
>> Join the e-mail list: http://tdl.org/news/newsletters/newsletter-signup/
>>
>> **
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
>> GigeNET is offering a free month of service with a new server in Ashburn.
>> Choose from 2 high performing configs, both with 100TB of bandwidth.
>> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
>> www.gigenet.com
>>
>>
>>
>> _______________________________________________
>> DSpace-tech mailing list
>> DSpace-tech@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>> List Etiquette:
>> https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
>>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to