Hi,

We have binary documents that we want to index (e.g. Word, Excel, Powerpoint, 
PDF, etc.) and we know we can index them with verity using the cfindex 
type=file attribute.  However, we want to index their content along with other 
content from our database.  So, we are looking at extracting the indexable text 
from these binary documents, putting into a query along with the other text we 
want to index, and then indexing THAT query with CFINDEX.

Does anyone know how to best convert these binary documents to text using 
ColdFusion, and if so, is it requiring a third party tool, etc. or can it be 
done with native CF tags/functions?

FYI - I have tried using CFFILE action=readbinary and then converting that 
value using toString().  This gives me some text, although I also get alot of 
junk along with it (ascii characters that are not readable, etc.) which I 
assume is part of the file definition for Word, or whatever binary document I'm 
converting.  I'm not sure if I can include this 'junk' in my index without 
harming the searchability of it, nor am I sure if I'm getting ALL of the text, 
ALL the time, so I would prefer to be able to extract JUST the text so it can 
be indexed properly.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:260526
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

Reply via email to