Hi all,
I mailed earlier today, but it doesn't look like the mail went through.
Apologies if you hear from me twice in one day on this same topic.
Here's our situation -
We have a user placing PDFs into DSpace as part of a complex item with multiple
files and file types. She has set some of bitstreams to be hidden, but once
the text has been extracted, despite the fact the actual PDF and TXT file are
hidden, a search will turn up the extracted text.
So - If the name "Ryan Steans" was in a PDF, but that PDF was hidden - the PDF
might be hidden, but my search result in DSpace would turn up "Ryan Steans" and
about 500 characters of text surround that name.
Some additional details on our use case -
* The item itself is public and is set to anonymous READ.
* This particular item has 1 Mp4 and 3 PDF's as bitstreams, all except for
1 PDF are set to anonymous READ.
* None of the bitstreams are set as the primary bitstream for the item.
* the 1 PDF that is set to restricted READ is the one that the media filter
is parsing and inserting into the "fulltext" value in solr... the other 2 PDF's
are not being indexed as fulltext and their contents are not searchable through
Discovery (or searchable in SOLR).
* The generated TXT file from the PDF has the same permissions (restricted)
as the original source bitstream.
The problem is that if you happen to search for anything in the fulltext of the
restricted item, it will show up in the results and the first ~500 chars of the
parsed-restricted-text file are displayed in the search results.
Looking to see if this is something anyone else has seen.
Is this an indexing problem? Have we found a bug?
thanks
Ryan Steans
Director of Operations
Texas Digital Library
512-495-4403
Web: http://www.tdl.org/
Twitter: @TxDigLibrary<http://twitter.com/TXDigLibrary>
Facebook: http://www.facebook.com/texasdigitallibrary
Join the e-mail list: http://tdl.org/news/newsletters/newsletter-signup/
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
www.gigenet.com
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette