On Tue, Sep 18, 2012 at 03:32:09PM +0200, helix84 wrote: > On Tue, Sep 18, 2012 at 2:48 PM, Tom Autry <tom.au...@coffingco.com> wrote: > > Unless I’ve missed something, it seems that the filter-media only works on > > files that need to be extracted first (i.e., from word, pdf, etc.) and > > doesn’t do any work on files that are already TEXT. Therefore, the > > information from these files do not get put into the search indices and are > > not found. I believe I’ve seen some configurations that use the HTML > > extractor on TEXT files but this seems to fail on larger TEXT files. Has > > anyone else had this problem or suggestions on a different media filter > > plugin to put a copy of these into the TEXT bundle so that they are properly > > searched? > > Interesting problem, that didn't occur to me. > > However, although it's not intuitive, according to documentation the > MS Word filter should be able to filter plain text files: > "Word Text Extractor > org.dspace.app.mediafilter.WordFilter > extracts the full text of Microsoft Word or Plain Text documents for > full text indexing. (Uses the "Microsoft Word Text Mining" tools.)" > > Can you verify that? > > If it doesn't work, we should probably write such "identity > transformation" filter (file a Jira issue). > As a workaround, you can just copy the text file to the "TEXT" bundle > and run index-update.
I don't understand: why would there be any need to extract plain text from a bitstream that's already plain text? Just index it. The point of text extraction is to create a plain-text bitstream for the indexer to digest. -- Mark H. Wood, Lead System Programmer mw...@iupui.edu Asking whether markets are efficient is like asking whether people are smart.
pgpHSeDNqUQsC.pgp
Description: PGP signature
------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech