Re: [iText-questions] Converting PDF to XML format

Avay Singh Mon, 02 Dec 2013 16:34:59 -0800

Hi Neha,

I have a working c# code for this, see if it helps.You will be able to see the 
xml document model of any pdf if you want and also do many a things post 
extraction :-


using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.xml;
using System.IO;
using Word = Microsoft.Office.Interop.Word;
using System.Xml;


namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            string pdfTemplate = null;
            if (ofdSelectPdf.ShowDialog() == DialogResult.OK)
            {
                // take the path of the pdf file.
                pdfTemplate = ofdSelectPdf.FileName;

            }
            textBox1.Text = ListFieldNames(pdfTemplate);
            textBox1.Select(0, 0);
        }
        
//declaration of function to check whether a form is xfa or pdf form

        public string ReadFieldnames(PdfReader reader)
        {
            AcroFields form = reader.AcroFields;
            XfaForm xfa = form.Xfa;
            StringBuilder sb = new StringBuilder();
            sb.Append(xfa.XfaPresent ? "XFA form" : "AcroForm");
            return sb.ToString();
        }

        //code to get the xml of the pdf in the textbox
        private string ListFieldNames(string pdfTemplate)
        {
            this.Text += " - " + pdfTemplate;
            // create a new PDF reader based on the PDF template document

            PdfReader pdfReader = new PdfReader(pdfTemplate);

           //checking the form is XFa form or not
            String str = ReadFieldnames(pdfReader);
            MessageBox.Show(str);

            XfaForm xfa = new XfaForm(pdfReader);
            System.Xml.XmlDocument doc = xfa.DomDocument;

            if (!string.IsNullOrEmpty(doc.DocumentElement.NamespaceURI))
            {
                doc.DocumentElement.SetAttribute("xmlns", "");
                System.Xml.XmlDocument new_doc = new System.Xml.XmlDocument();
                new_doc.LoadXml(doc.OuterXml);
                doc = new_doc;
            }

            StringBuilder sb = new StringBuilder();
            var Xsettings = new System.Xml.XmlWriterSettings() { Indent = true 
};
            using (var writer = System.Xml.XmlWriter.Create(sb, Xsettings))
            {
                doc.WriteTo(writer);
            }

            return sb.ToString();
        }
     }

}

Thanks,
Avay


On Tuesday, 3 December 2013 2:47 AM, Larry Evans <cppljev...@suddenlink.net> 
wrote:
 
On 12/02/13 11:56, Larry Evans wrote:
> On 12/02/13 04:30, Neha Jain wrote:
>> Hi Team,
>>
>> I have a requirement of converting a PDF to XML i.e contents of PDF to XML
>>
>> I have tried using TaggedPdfReaderToolbut I get the following exception
>>
>> Exception in thread "main" _java.io.IOException_: No StructTreeRoot
>> found, this probably isn't a tagged PDF document!
>>
>> I understand that PDF is unstructured(no tags to identify headings,
>> title, table, image etc) and so it cannot covert document to xml.
>
> A pdf file can either be tagged or not; however, tags is this context
> are not the tags in and html or xml context.
> Chapter 13 of the itext book:
>
> http://itextpdf.com/book/chapter.php?id=13
>
> on page 423 explains what a tagged pdf file is.

Page 514 of the book says the TaggedPdfReaderTool:

   won't work for PDF documents that don't have any structure...
   but it will work for most tagged PDF files.

So I guess your out of luck with an untagged PDF document.

>
>>
>> Please confirm my understanding.
>>
>> I have tried using PDFReader class which helps me get entire content of
>> pdf but I am not able to find out which is the heading , title, table in
>> the pdf content. My requirement is to create an XML doc with heading in
>> pdf as tags and content in pdf as tag-element contents.
>>
>> Please let me know how this can be achieved using iText. Its urgent.
>
> I don't know how to do this without a tagged pdf.  With a tagged pdf,
> TaggedPdfReaderTool works;
[snip]
There is another tool:

http://www.mobipocket.com/dev/pdf2xml/

However, it doesn't handle fields, or it doesn't show
any fields when run on:

  http://www.irs.gov/pub/irs-pdf/f1040.pdf

Instead, it just puts the text in xml elements.


HTH.

-regards,
Larry



------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Converting PDF to XML format

Reply via email to