Re: [Rails] Extract text from PDF file

Walter Lee Davis Mon, 31 Jan 2011 09:32:15 -0800


On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:

Hi,
In my upcoming application we are uploading the pdf files.
After uploading the pdf file I have to extract the text from pdf and
display it to user.
can anyone tell me how to extract text from pdf file?
Is there any plugin or gem present for this?
Thanks,
Tushar

I did this using Paperclip and defining a processor for Paperclip asfollows:


#lib/paperclip_processors/text.rb
module Paperclip
  # Handles extracting plain text from PDF file attachments
  class Text < Processor

    attr_accessor :whiny

    # Creates a Text extract from PDF
    def make
      src = @file
      dst = Tempfile.new([@basename, 'txt'].compact.join("."))
      command = <<-end_command
        "#{ File.expand_path(src.path) }"
        "#{ File.expand_path(dst.path) }"
      end_command

      begin

success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",command.gsub(/\s+/, " "))Rails.logger.info "Processing #{src.path} to #{dst.path} inthe text processor."

      rescue PaperclipCommandLineError

raise PaperclipError, "There was an error processing the textfor #{@basename}" if @whiny

      end
      dst
    end
  end
end

#app/models/document.rb

has_attached_file :pdf,:styles => { :text => { :fake =>'variable' } }, :processors => [:text]

  after_post_process :extract_text

  private
  def extract_text
    file = File.open("#{pdf.queued_for_write[:text].path}","r")
    plain_text = ""
    while (line = file.gets)
      plain_text << Iconv.conv('ASCII//IGNORE', 'UTF8', line)
    end

self.plain_text = plain_text #text column to hold the extractedtext for searching

end

I had to find and install the creaky-old pdftotext library on myserver (happily, there was an apt-get bundle for it) and configure thepath correctly. When Paperclip accepts a PDF upload, it creates a textextraction of that file and saves it in system/pdfs/:id/text/filename.pdf. Note that while it has a .pdf extension, the file itselfis actually just the plain text extracted from the original pdf. Afterquite a lot of googling and begging my local Ruby group, I got therecipe for ripping open that text file and reading it into a variableto store on the record. The text you get out of pdftotext will varywildly in quality and comprehensiveness, but since all I needed was away to get a simple search system fed, it works fine for my needs. Inever show this text to anyone, just use it as the "keywords" forsearch. You may want/need to present an editing field for theadministrator to clean up these extracted texts.


Walter

--
You received this message because you are subscribed to the Google Groups "Ruby on 
Rails: Talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en.

Re: [Rails] Extract text from PDF file

Reply via email to