Hi all,

I mailed earlier today, but it doesn't look like the mail went through.  
Apologies if you hear from me twice in one day on this same topic.

Here's our situation -

We have a user placing PDFs into DSpace as part of a complex item with multiple 
files and file types.  She has set some of bitstreams to be hidden, but once 
the text has been extracted, despite the fact the actual PDF and TXT file are 
hidden, a search will turn up the extracted text.

So - If the name "Ryan Steans" was in a PDF, but that PDF was hidden - the PDF 
might be hidden, but my search result in DSpace would turn up "Ryan Steans" and 
about 500 characters of text surround that name.

Some additional details on our use case -

  *   The item itself is public and is set to anonymous READ.
  *   This particular item has 1 Mp4 and 3 PDF's as bitstreams, all except for 
1 PDF are set to anonymous READ.
  *   None of the bitstreams are set as the primary bitstream for the item.
  *   the 1 PDF that is set to restricted READ is the one that the media filter 
is parsing and inserting into the "fulltext" value in solr... the other 2 PDF's 
are not being indexed as fulltext and their contents are not searchable through 
Discovery (or searchable in SOLR).
  *   The generated TXT file from the PDF has the same permissions (restricted) 
as the original source bitstream.
The problem is that if you happen to search for anything in the fulltext of the 
restricted item, it will show up in the results and the first ~500 chars of the 
parsed-restricted-text file are displayed in the search results.

Looking to see if this is something anyone else has seen.

Is this an indexing problem?  Have we found a bug?

thanks


Ryan Steans
Director of Operations
Texas Digital Library
512-495-4403

Web: http://www.tdl.org/
Twitter: @TxDigLibrary<http://twitter.com/TXDigLibrary>
Facebook: http://www.facebook.com/texasdigitallibrary
Join the e-mail list: http://tdl.org/news/newsletters/newsletter-signup/


------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
www.gigenet.com
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to