Hi Neha,
I have a working c# code for this, see if it helps.You will be able to see the
xml document model of any pdf if you want and also do many a things post
extraction :-
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.xml;
using System.IO;
using Word = Microsoft.Office.Interop.Word;
using System.Xml;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
string pdfTemplate = null;
if (ofdSelectPdf.ShowDialog() == DialogResult.OK)
{
// take the path of the pdf file.
pdfTemplate = ofdSelectPdf.FileName;
}
textBox1.Text = ListFieldNames(pdfTemplate);
textBox1.Select(0, 0);
}
//declaration of function to check whether a form is xfa or pdf form
public string ReadFieldnames(PdfReader reader)
{
AcroFields form = reader.AcroFields;
XfaForm xfa = form.Xfa;
StringBuilder sb = new StringBuilder();
sb.Append(xfa.XfaPresent ? "XFA form" : "AcroForm");
return sb.ToString();
}
//code to get the xml of the pdf in the textbox
private string ListFieldNames(string pdfTemplate)
{
this.Text += " - " + pdfTemplate;
// create a new PDF reader based on the PDF template document
PdfReader pdfReader = new PdfReader(pdfTemplate);
//checking the form is XFa form or not
String str = ReadFieldnames(pdfReader);
MessageBox.Show(str);
XfaForm xfa = new XfaForm(pdfReader);
System.Xml.XmlDocument doc = xfa.DomDocument;
if (!string.IsNullOrEmpty(doc.DocumentElement.NamespaceURI))
{
doc.DocumentElement.SetAttribute("xmlns", "");
System.Xml.XmlDocument new_doc = new System.Xml.XmlDocument();
new_doc.LoadXml(doc.OuterXml);
doc = new_doc;
}
StringBuilder sb = new StringBuilder();
var Xsettings = new System.Xml.XmlWriterSettings() { Indent = true
};
using (var writer = System.Xml.XmlWriter.Create(sb, Xsettings))
{
doc.WriteTo(writer);
}
return sb.ToString();
}
}
}
Thanks,
Avay
On Tuesday, 3 December 2013 2:47 AM, Larry Evans <cppljev...@suddenlink.net>
wrote:
On 12/02/13 11:56, Larry Evans wrote:
> On 12/02/13 04:30, Neha Jain wrote:
>> Hi Team,
>>
>> I have a requirement of converting a PDF to XML i.e contents of PDF to XML
>>
>> I have tried using TaggedPdfReaderToolbut I get the following exception
>>
>> Exception in thread "main" _java.io.IOException_: No StructTreeRoot
>> found, this probably isn't a tagged PDF document!
>>
>> I understand that PDF is unstructured(no tags to identify headings,
>> title, table, image etc) and so it cannot covert document to xml.
>
> A pdf file can either be tagged or not; however, tags is this context
> are not the tags in and html or xml context.
> Chapter 13 of the itext book:
>
> http://itextpdf.com/book/chapter.php?id=13
>
> on page 423 explains what a tagged pdf file is.
Page 514 of the book says the TaggedPdfReaderTool:
won't work for PDF documents that don't have any structure...
but it will work for most tagged PDF files.
So I guess your out of luck with an untagged PDF document.
>
>>
>> Please confirm my understanding.
>>
>> I have tried using PDFReader class which helps me get entire content of
>> pdf but I am not able to find out which is the heading , title, table in
>> the pdf content. My requirement is to create an XML doc with heading in
>> pdf as tags and content in pdf as tag-element contents.
>>
>> Please let me know how this can be achieved using iText. Its urgent.
>
> I don't know how to do this without a tagged pdf. With a tagged pdf,
> TaggedPdfReaderTool works;
[snip]
There is another tool:
http://www.mobipocket.com/dev/pdf2xml/
However, it doesn't handle fields, or it doesn't show
any fields when run on:
http://www.irs.gov/pub/irs-pdf/f1040.pdf
Instead, it just puts the text in xml elements.
HTH.
-regards,
Larry
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php