Extracting text from various file-types
Hello again to all. I need a way to extract text from word, excel, text, pdf, and ppt files with Coldfusion, as the files are each submitted via a form. The output does not have to be particularly pretty or nicely formatted -- just plain text that can be stored and searched later. Any ideas? --RR ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:352103 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Extracting text from various file-types
Check out the CFFILE tag. That offers this type of functionality. Bruce On Aug 10, 2012, at 4:07 PM, Robert Rhodes rrhode...@gmail.com wrote: Hello again to all. I need a way to extract text from word, excel, text, pdf, and ppt files with Coldfusion, as the files are each submitted via a form. The output does not have to be particularly pretty or nicely formatted -- just plain text that can be stored and searched later. Any ideas? --RR ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:352104 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Extracting text from various file-types
Hi Bruce. Thanks for the reply. I did, but no luck. On text files, I got the text just fine. On Word docs, I got the text but with a whole bunch of garbage in the return. On ppt, pdf, and excel docs, they all come out as unreadable garbage. I tried both the read and readbinary actions and they both did not work. Maybe I am doing something wrong? I am using CF9. -RR On Fri, Aug 10, 2012 at 6:11 PM, Bruce Sorge sor...@gmail.com wrote: Check out the CFFILE tag. That offers this type of functionality. Bruce On Aug 10, 2012, at 4:07 PM, Robert Rhodes rrhode...@gmail.com wrote: Hello again to all. I need a way to extract text from word, excel, text, pdf, and ppt files with Coldfusion, as the files are each submitted via a form. The output does not have to be particularly pretty or nicely formatted -- just plain text that can be stored and searched later. Any ideas? --RR ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:352110 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Extracting text from various file-types
For word, did you add the attribute in cffile action=readbinary? For excel, there is a cfspreadsheet tag that will read a spreadsheet and you can put a query attribute on it and output the result. For PDF's, there is a cfpdf tag that you can use. Obviously you will have to get the file type then use cfif to tell the page which tag to use for which file. Hope this helps Bruce On Aug 10, 2012, at 4:48 PM, Robert Rhodes rrhode...@gmail.com wrote: Hi Bruce. Thanks for the reply. I did, but no luck. On text files, I got the text just fine. On Word docs, I got the text but with a whole bunch of garbage in the return. On ppt, pdf, and excel docs, they all come out as unreadable garbage. I tried both the read and readbinary actions and they both did not work. Maybe I am doing something wrong? I am using CF9. -RR ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:352113 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Extracting text from various file-types
I do not have the URL handy but take a look at Raymond Camden's blog. He wrote an entry on extracting text from MS Office documents using POI. For PDF, use cfpf's extract text option. -Leigh ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:352115 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm