Re: [Rails] Extract text from PDF file

Walter McGinnis Mon, 31 Jan 2011 10:19:39 -0800

I wrote a plugin that requires attachment_fu and some unixy utilities behind
the scenes for this several years back:


https://github.com/kete/convert_attachment_to

It works reliably in Rails 2.x apps. I haven't tried it with Rails 3 yet.
You could fork it and update (make it work with PaperClip or Rails 3) it you
like or just have a gander for example code.

Cheers,
Walter



On Tue, Feb 1, 2011 at 6:36 AM, Garrett Lancaster <
[email protected]> wrote:

>  pdftk, pdfbox (java), pdfkit
>
> Garrett Lancaster
>
>  ------------------------------
>
>    Walter Lee Davis <[email protected]>
> January 31, 2011 11:32 AM
>
>
> On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:
>
>
> I did this using Paperclip and defining a processor for Paperclip as
> follows:
>
> #lib/paperclip_processors/text.rb
> module Paperclip
>   # Handles extracting plain text from PDF file attachments
>   class Text < Processor
>
>     attr_accessor :whiny
>
>     # Creates a Text extract from PDF
>     def make
>       src = @file
>       dst = Tempfile.new([@basename, 'txt'].compact.join("."))
>       command = <<-end_command
>         "#{ File.expand_path(src.path) }"
>         "#{ File.expand_path(dst.path) }"
>       end_command
>
>       begin
>         success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",
> command.gsub(/\s+/, " "))
>         Rails.logger.info "Processing #{src.path} to #{dst.path} in the
> text processor."
>       rescue PaperclipCommandLineError
>         raise PaperclipError, "There was an error processing the text for
> #{@basename}" if @whiny
>       end
>       dst
>     end
>   end
> end
>
> #app/models/document.rb
>   has_attached_file :pdf,:styles => { :text => { :fake => 'variable' } },
> :processors => [:text]
>   after_post_process :extract_text
>
>   private
>   def extract_text
>     file = File.open("#{pdf.queued_for_write[:text].path}","r")
>     plain_text = ""
>     while (line = file.gets)
>       plain_text << Iconv.conv('ASCII//IGNORE', 'UTF8', line)
>     end
>     self.plain_text = plain_text #text column to hold the extracted text
> for searching
>   end
>
> I had to find and install the creaky-old pdftotext library on my server
> (happily, there was an apt-get bundle for it) and configure the path
> correctly. When Paperclip accepts a PDF upload, it creates a text extraction
> of that file and saves it in system/pdfs/:id/text/filename.pdf. Note that
> while it has a .pdf extension, the file itself is actually just the plain
> text extracted from the original pdf. After quite a lot of googling and
> begging my local Ruby group, I got the recipe for ripping open that text
> file and reading it into a variable to store on the record. The text you get
> out of pdftotext will vary wildly in quality and comprehensiveness, but
> since all I needed was a way to get a simple search system fed, it works
> fine for my needs. I never show this text to anyone, just use it as the
> "keywords" for search. You may want/need to present an editing field for the
> administrator to clean up these extracted texts.
>
> Walter
>
> ------------------------------
>
>    Tushar Gandhi <[email protected]>
> January 31, 2011 11:12 AM
>
> Hi,
> In my upcoming application we are uploading the pdf files.
> After uploading the pdf file I have to extract the text from pdf and
> display it to user.
> can anyone tell me how to extract text from pdf file?
> Is there any plugin or gem present for this?
> Thanks,
> Tushar
>
>    --
> You received this message because you are subscribed to the Google Groups
> "Ruby on Rails: Talk" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<rubyonrails-talk%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/rubyonrails-talk?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en.

Re: [Rails] Extract text from PDF file

Reply via email to