I wrote a plugin that requires attachment_fu and some unixy utilities behind the scenes for this several years back:
https://github.com/kete/convert_attachment_to It works reliably in Rails 2.x apps. I haven't tried it with Rails 3 yet. You could fork it and update (make it work with PaperClip or Rails 3) it you like or just have a gander for example code. Cheers, Walter On Tue, Feb 1, 2011 at 6:36 AM, Garrett Lancaster < [email protected]> wrote: > pdftk, pdfbox (java), pdfkit > > Garrett Lancaster > > ------------------------------ > > Walter Lee Davis <[email protected]> > January 31, 2011 11:32 AM > > > On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote: > > > I did this using Paperclip and defining a processor for Paperclip as > follows: > > #lib/paperclip_processors/text.rb > module Paperclip > # Handles extracting plain text from PDF file attachments > class Text < Processor > > attr_accessor :whiny > > # Creates a Text extract from PDF > def make > src = @file > dst = Tempfile.new([@basename, 'txt'].compact.join(".")) > command = <<-end_command > "#{ File.expand_path(src.path) }" > "#{ File.expand_path(dst.path) }" > end_command > > begin > success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", > command.gsub(/\s+/, " ")) > Rails.logger.info "Processing #{src.path} to #{dst.path} in the > text processor." > rescue PaperclipCommandLineError > raise PaperclipError, "There was an error processing the text for > #{@basename}" if @whiny > end > dst > end > end > end > > #app/models/document.rb > has_attached_file :pdf,:styles => { :text => { :fake => 'variable' } }, > :processors => [:text] > after_post_process :extract_text > > private > def extract_text > file = File.open("#{pdf.queued_for_write[:text].path}","r") > plain_text = "" > while (line = file.gets) > plain_text << Iconv.conv('ASCII//IGNORE', 'UTF8', line) > end > self.plain_text = plain_text #text column to hold the extracted text > for searching > end > > I had to find and install the creaky-old pdftotext library on my server > (happily, there was an apt-get bundle for it) and configure the path > correctly. When Paperclip accepts a PDF upload, it creates a text extraction > of that file and saves it in system/pdfs/:id/text/filename.pdf. Note that > while it has a .pdf extension, the file itself is actually just the plain > text extracted from the original pdf. After quite a lot of googling and > begging my local Ruby group, I got the recipe for ripping open that text > file and reading it into a variable to store on the record. The text you get > out of pdftotext will vary wildly in quality and comprehensiveness, but > since all I needed was a way to get a simple search system fed, it works > fine for my needs. I never show this text to anyone, just use it as the > "keywords" for search. You may want/need to present an editing field for the > administrator to clean up these extracted texts. > > Walter > > ------------------------------ > > Tushar Gandhi <[email protected]> > January 31, 2011 11:12 AM > > Hi, > In my upcoming application we are uploading the pdf files. > After uploading the pdf file I have to extract the text from pdf and > display it to user. > can anyone tell me how to extract text from pdf file? > Is there any plugin or gem present for this? > Thanks, > Tushar > > -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<rubyonrails-talk%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/rubyonrails-talk?hl=en. > -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.

