Re: [Dspace-tech] Filter-media on TEXT files

helix84 Tue, 18 Sep 2012 06:33:20 -0700

On Tue, Sep 18, 2012 at 2:48 PM, Tom Autry <tom.au...@coffingco.com> wrote:
> Unless I’ve missed something, it seems that the filter-media only works on
> files that need to be extracted first (i.e., from word, pdf, etc.) and
> doesn’t do any work on files that are already TEXT.  Therefore, the
> information from these files do not get put into the search indices and are
> not found.  I believe I’ve seen some configurations that use the HTML
> extractor on TEXT files but this seems to fail on larger TEXT files.  Has
> anyone else had this problem or suggestions on a different media filter
> plugin to put a copy of these into the TEXT bundle so that they are properly
> searched?


Interesting problem, that didn't occur to me.

However, although it's not intuitive, according to documentation the
MS Word filter should be able to filter plain text files:
"Word Text Extractor
org.dspace.app.mediafilter.WordFilter
extracts the full text of Microsoft Word or Plain Text documents for
full text indexing. (Uses the "Microsoft Word Text Mining" tools.)"

Can you verify that?

If it doesn't work, we should probably write such "identity
transformation" filter (file a Jira issue).
As a workaround, you can just copy the text file to the "TEXT" bundle
and run index-update.

Regards,
~~helix84

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Filter-media on TEXT files

Reply via email to