Hi folks,

 

I would appreciate any advice on the following issue, I have been trying all
kinds of encoding such as utf-8, iso-8859-1, etc.  Also I have tried setting
some basefonts (very trial and error), but no luck so far - I admit openly
that I'm an itextsharp newb, and a c# newb too - thanks in advance for your
understanding.

 

I am reading a small portion of a pdf page set by a rectangular area using
ITextExtractionStrategy, then writing out the text into a csv file.  

 

Everything seems to work except that every time there is a dash, "-" in the
pdf, it gets written out to my csv as a null character (shows "NUL " in
notepad++)

 

- code is pretty straightforward, it just reads through all pdf's in a
folder and pulls data off the first page, then writes that out to a text
file:

 

 

using System.Configuration;

using System.Collections.Specialized;

using System;

using System.IO;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using iTextSharp.text;

using iTextSharp.text.pdf;

using iTextSharp.text.pdf.parser;

 

namespace ConsoleApplication1

{

    class Program

    {

        static void Main(string[] args)

        {

            string PDFPath = ConfigurationManager.AppSettings["PDFPath"];

            string CSVPath = ConfigurationManager.AppSettings["CSVPath"];

            int fileCounter = 0;

 

            System.IO.StreamWriter fileout = new
System.IO.StreamWriter(CSVPath);

 

            foreach (string fileName in Directory.GetFiles(PDFPath))

            {

                try

                {

                    PdfReader reader = new PdfReader(fileName);

     

                    //define rectangular area of page where the microtext
is...and the itext "strategy"

                    System.util.RectangleJ rect = new
System.util.RectangleJ(0, 0, 3000, 30);

                    RenderFilter[] renderFilter = new RenderFilter[1];

                    renderFilter[0] = new RegionTextRenderFilter(rect);

                    ITextExtractionStrategy textExtractionStrategy = new
FilteredTextRenderListener(new LocationTextExtractionStrategy(),
renderFilter);

 

                    string tempString =
PdfTextExtractor.GetTextFromPage(reader, 1, textExtractionStrategy);   //
.Replace("\n", "")

                    string numberOfPages = reader.NumberOfPages.ToString();

                    

                    //write line to file

                    fileout.WriteLine(tempString + "|" +
Path.GetFileName(fileName) + "|" + numberOfPages);

 

                }

                catch (IOException)

                {

                    Console.WriteLine("Bad PDF!");

                }

                catch (Exception e)

                {

                    Console.WriteLine("The MicroTextReader process failed:
{0}", e.ToString());

                }

            }

            fileout.Close();

            // Suspend the screen. (keep console window open and in
suspension - allows to view console output....)

            //Console.ReadLine();

 

        }

    }

}

 

 

 

Any advice here?  Thanks for your consideration!

 

/trev

 

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to