Re: [Rails] Extract text from PDF file

Garrett Lancaster Mon, 31 Jan 2011 12:22:11 -0800

PDFBox is the library I'm using on a current project:http://pdfbox.apache.org/There is a link to "Extract Text" under Command Line Utilities. There isalso a section called "Text Extraction" under Tutorials.

There is a ruby command line utility that wraps PDFBox called Docsplit:http://documentcloud.github.com/docsplit/ that might be worth looking into.


For pdftk: http://pdf-toolkit.rubyforge.org/classes/PDF/Toolkit.html#M000003


Hope this helps,
Garrett Lancaster

------------------------------------------------------------------------

        Walter Lee Davis <mailto:[email protected]>
January 31, 2011 1:58 PM
I don't see how these relate to the question -- they are apparentlydesigned to generate PDFs rather than to extract text from existingPDF documents. Can you point to an example where these libraries canbe used in that fashion? I'd love to use something more professionallydeveloped than my own system.
Walter

On Jan 31, 2011, at 12:36 PM, Garrett Lancaster wrote:
pdftk, pdfbox (java), pdfkit

Garrett Lancaster
Walter Lee Davis
January 31, 2011 11:32 AM


On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:
I did this using Paperclip and defining a processor for Paperclip asfollows:
------------------------------------------------------------------------

        Garrett Lancaster <mailto:[email protected]>
January 31, 2011 11:36 AM


pdftk, pdfbox (java), pdfkit

Garrett Lancaster

------------------------------------------------------------------------

        Walter Lee Davis <mailto:[email protected]>
January 31, 2011 11:32 AM



On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:
I did this using Paperclip and defining a processor for Paperclip asfollows:
#lib/paperclip_processors/text.rb
module Paperclip
  # Handles extracting plain text from PDF file attachments
  class Text < Processor

    attr_accessor :whiny

    # Creates a Text extract from PDF
    def make
      src = @file
      dst = Tempfile.new([@basename, 'txt'].compact.join("."))
      command = <<-end_command
        "#{ File.expand_path(src.path) }"
        "#{ File.expand_path(dst.path) }"
      end_command

      begin
success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",command.gsub(/\s+/, " "))Rails.logger.info "Processing #{src.path} to #{dst.path} inthe text processor."
      rescue PaperclipCommandLineError
raise PaperclipError, "There was an error processing the textfor #{@basename}" if @whiny
      end
      dst
    end
  end
end

#app/models/document.rb
has_attached_file :pdf,:styles => { :text => { :fake => 'variable' }}, :processors => [:text]
  after_post_process :extract_text

  private
  def extract_text
    file = File.open("#{pdf.queued_for_write[:text].path}","r")
    plain_text = ""
    while (line = file.gets)
      plain_text << Iconv.conv('ASCII//IGNORE', 'UTF8', line)
    end
self.plain_text = plain_text #text column to hold the extractedtext for searching
  end
I had to find and install the creaky-old pdftotext library on myserver (happily, there was an apt-get bundle for it) and configure thepath correctly. When Paperclip accepts a PDF upload, it creates a textextraction of that file and saves it insystem/pdfs/:id/text/filename.pdf. Note that while it has a .pdfextension, the file itself is actually just the plain text extractedfrom the original pdf. After quite a lot of googling and begging mylocal Ruby group, I got the recipe for ripping open that text file andreading it into a variable to store on the record. The text you getout of pdftotext will vary wildly in quality and comprehensiveness,but since all I needed was a way to get a simple search system fed, itworks fine for my needs. I never show this text to anyone, just use itas the "keywords" for search. You may want/need to present an editingfield for the administrator to clean up these extracted texts.
Walter

------------------------------------------------------------------------

        Tushar Gandhi <mailto:[email protected]>
January 31, 2011 11:12 AM


Hi,
In my upcoming application we are uploading the pdf files.
After uploading the pdf file I have to extract the text from pdf and
display it to user.
can anyone tell me how to extract text from pdf file?
Is there any plugin or gem present for this?
Thanks,
Tushar


--
You received this message because you are subscribed to the Google Groups "Ruby on 
Rails: Talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en.

<<inline: compose-unknown-contact.jpg>>

<<inline: postbox-contact.jpg>>

Re: [Rails] Extract text from PDF file

Reply via email to