[sage-trac] Re: [Sage] #4825: extract worksheets embedded in pdf files

Sage Wed, 09 Dec 2009 06:42:33 -0800

#4825: extract worksheets embedded in pdf files
---------------------------+------------------------------------------------
   Reporter:  jason        |       Owner:  boothby 
       Type:  enhancement  |      Status:  new     
   Priority:  major        |   Milestone:  sage-4.3
  Component:  notebook     |    Keywords:          
Work_issues:               |      Author:          
   Upstream:  N/A          |    Reviewer:          
     Merged:               |  
---------------------------+------------------------------------------------
Changes (by jason):


  * upstream:  => N/A


Old description:

> This is an ongoing discussion on sage-devel right now.
>
> Basically, we'd like to embed an sws file in a pdf and then be able to
> upload the pdf file to the notebook and have the notebook automatically
> extract the sws file and create the worksheet.
>
> We can use pdfminer to extract the data.  Here's a sample program which
> extracts the first embedded file in a pdf named 'foo.pdf'.
>
> {{{
> from pdflib.pdfparser import PDFDocument, PDFParser
> import sys
> stdout = sys.stdout
>
> doc = PDFDocument()
> fp = file('foo.pdf', 'rb')
> parser = PDFParser(doc, fp)
> doc.initialize()
>
> for xref in doc.xrefs:
>     for objid in xref.objids():
>         try:
>             obj = doc.getobj(objid)
>         except:
>             continue
>         if isinstance(obj,dict) and 'Type' in obj and obj['Type'].name ==
> "Annot":
>             if 'Subtype' in obj and obj['Subtype'].name ==
> "FileAttachment":
>                 # We have an attached file!
>                 filespec = obj['FS']
>                 # Look for embedded file; we could try to extract the
>                 # filename too (and make sure it's an sws file). but that
> is platform dependent.  See page
>                 # 182 (Section 3.10.2) of
>                 #
> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf.
>                 if 'EF' in filespec:
>                     fileobj = filespec['EF']['F']
>                     embeddedspec = filespec['EF']
>                     stdout.write(fileobj.resolve().get_data())
>                     # Just output the first file found.
>                     exit()
> }}}

New description:

 This is an ongoing discussion on sage-devel right now:
 http://groups.google.com/group/sage-
 devel/browse_frm/thread/65a932ea328b1afb/91ced495a0a1c27a

 Basically, we'd like to embed an sws file in a pdf and then be able to
 upload the pdf file to the notebook and have the notebook automatically
 extract the sws file and create the worksheet.

 We can use pdfminer to extract the data.  Here's a sample program which
 extracts the first embedded file in a pdf named 'foo.pdf'.

 {{{
 from pdflib.pdfparser import PDFDocument, PDFParser
 import sys
 stdout = sys.stdout

 doc = PDFDocument()
 fp = file('foo.pdf', 'rb')
 parser = PDFParser(doc, fp)
 doc.initialize()

 for xref in doc.xrefs:
     for objid in xref.objids():
         try:
             obj = doc.getobj(objid)
         except:
             continue
         if isinstance(obj,dict) and 'Type' in obj and obj['Type'].name ==
 "Annot":
             if 'Subtype' in obj and obj['Subtype'].name ==
 "FileAttachment":
                 # We have an attached file!
                 filespec = obj['FS']
                 # Look for embedded file; we could try to extract the
                 # filename too (and make sure it's an sws file). but that
 is platform dependent.  See page
                 # 182 (Section 3.10.2) of
                 #
 http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf.
                 if 'EF' in filespec:
                     fileobj = filespec['EF']['F']
                     embeddedspec = filespec['EF']
                     stdout.write(fileobj.resolve().get_data())
                     # Just output the first file found.
                     exit()
 }}}

--

-- 
Ticket URL: <http://trac.sagemath.org/sage_trac/ticket/4825#comment:3>
Sage <http://www.sagemath.org>
Sage: Creating a Viable Open Source Alternative to Magma, Maple, Mathematica, 
and MATLAB

--

You received this message because you are subscribed to the Google Groups 
"sage-trac" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sage-trac?hl=en.

[sage-trac] Re: [Sage] #4825: extract worksheets embedded in pdf files

Reply via email to