On Tue, Sep 18, 2012 at 2:48 PM, Tom Autry <tom.au...@coffingco.com> wrote: > Unless I’ve missed something, it seems that the filter-media only works on > files that need to be extracted first (i.e., from word, pdf, etc.) and > doesn’t do any work on files that are already TEXT. Therefore, the > information from these files do not get put into the search indices and are > not found. I believe I’ve seen some configurations that use the HTML > extractor on TEXT files but this seems to fail on larger TEXT files. Has > anyone else had this problem or suggestions on a different media filter > plugin to put a copy of these into the TEXT bundle so that they are properly > searched?
Interesting problem, that didn't occur to me. However, although it's not intuitive, according to documentation the MS Word filter should be able to filter plain text files: "Word Text Extractor org.dspace.app.mediafilter.WordFilter extracts the full text of Microsoft Word or Plain Text documents for full text indexing. (Uses the "Microsoft Word Text Mining" tools.)" Can you verify that? If it doesn't work, we should probably write such "identity transformation" filter (file a Jira issue). As a workaround, you can just copy the text file to the "TEXT" bundle and run index-update. Regards, ~~helix84 ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech