On Tue, Sep 18, 2012 at 03:32:09PM +0200, helix84 wrote:
> On Tue, Sep 18, 2012 at 2:48 PM, Tom Autry <tom.au...@coffingco.com> wrote:
> > Unless I’ve missed something, it seems that the filter-media only works on
> > files that need to be extracted first (i.e., from word, pdf, etc.) and
> > doesn’t do any work on files that are already TEXT.  Therefore, the
> > information from these files do not get put into the search indices and are
> > not found.  I believe I’ve seen some configurations that use the HTML
> > extractor on TEXT files but this seems to fail on larger TEXT files.  Has
> > anyone else had this problem or suggestions on a different media filter
> > plugin to put a copy of these into the TEXT bundle so that they are properly
> > searched?
> 
> Interesting problem, that didn't occur to me.
> 
> However, although it's not intuitive, according to documentation the
> MS Word filter should be able to filter plain text files:
> "Word Text Extractor
> org.dspace.app.mediafilter.WordFilter
> extracts the full text of Microsoft Word or Plain Text documents for
> full text indexing. (Uses the "Microsoft Word Text Mining" tools.)"
> 
> Can you verify that?
> 
> If it doesn't work, we should probably write such "identity
> transformation" filter (file a Jira issue).
> As a workaround, you can just copy the text file to the "TEXT" bundle
> and run index-update.

I don't understand:  why would there be any need to extract plain text
from a bitstream that's already plain text?  Just index it.  The point
of text extraction is to create a plain-text bitstream for the indexer
to digest.

-- 
Mark H. Wood, Lead System Programmer   mw...@iupui.edu
Asking whether markets are efficient is like asking whether people are smart.

Attachment: pgpHSeDNqUQsC.pgp
Description: PGP signature

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to