[Nutch-dev] RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman Thu, 02 Mar 2006 01:15:06 -0800

I am not a pdf guru, but I have amassed quite a bit of information on
the topic.  I have pinged around asking the pdf mavens of the world what
the issues are about parsing pdf and reading up on the subject to get a
better understanding.  I have contributed all of this to mailing lists,
but coding this is not something I would feel confortable doing at this
point.  Maybe it would be best for a coordinated lucene-nutch-pdfbox
development to produce some good code to do this.  I am trying to get
some dialog going.


Here is some code I was asked to debug by  another interested developer
that uses PDFBox to extract pdf tabular data, it seems to have some bugs
in it that I am trying to figure out.


        try
        {
                int i =1;
                String wordsep = null;
                String str = null;
                boolean flag = false;
                Writer output = null;
                PDDocument document = null;
                document = PDDocument.load( "53 Nostro Ofc Cofc Daily
Position_AUS.pdf" );
                
                PDDocumentOutline root =
document.getDocumentCatalog().getDocumentOutline();
                PDOutlineItem item = root.getFirstChild();
                PDOutlineItem item1 = item.getNextSibling();
                
                while( item1 != null )
                {       
                        System.out.println( "Item:" + item.getTitle() );
                        System.out.println( "Item1:" + item1.getTitle()
);
                        output = new OutputStreamWriter(new
FileOutputStream( "simple"+i+".txt" ) );
                        PDFTextStripperByArea stripper= null;
                        stripper=new PDFTextStripperByArea(); 
                        List reg = stripper.getRegions();
                        System.out.println(reg.size());
          
                //      PDFTextStripper stripper = null;
                        //stripper = new PDFTextStripper();
                        wordsep = stripper.getWordSeparator();
                        //stripper.setSortByPosition(true);
        
                        stripper.setStartBookmark(item);
                        
                        
                        stripper.setLineSeparator("\n");
                        stripper.setWordSeparator("  ");
                        stripper.setPageSeparator("\n\n\n\n");
                        stripper.setWordSeparator("   ");
                        stripper.setEndBookmark(item1);
                        //str = stripper.getText(document);
                        //output.write( str, 0, str.length()); 
                         
                        stripper.writeText( document, output );
                        i++;
                        item = item.getNextSibling();
                        item1 = item1.getNextSibling();
                        
                }
                        PDOutlineItem child = item.getFirstChild();
                        PDOutlineItem child1 = new PDOutlineItem();
                        while( child != null )
                        {
                                child1 = child; 
                                child = child.getNextSibling();
                                
                        }
                        System.out.println( "Item:" + item.getTitle() );
                        System.out.println( "Item1:" + child1.getTitle()
);
                        output = new OutputStreamWriter(new
FileOutputStream( "simple"+i+".txt" ) );
                        PDFTextStripperByArea stripper= null;
                        stripper=new PDFTextStripperByArea(); 
                        
                        System.out.println("The word separator
is"+flag);
                        
                        //stripper.setSortByPosition(true);
                        
          
                        stripper.setLineSeparator("\n");
                        
                        stripper.setPageSeparator("\n\n\n\n");
                        stripper.setWordSeparator("  ");
                        stripper.setStartBookmark(item);
                        stripper.setEndBookmark(child1);
                        //str = stripper.getText(document);
 
stripper.setShouldSeparateByBeads(stripper.shouldSeparateByBeads());
                        stripper.writeText( document, output );
                
                output.close();  
                document.close();
        }
         catch(Exception ex)
        {
                System.out.println(ex);
        }
    }    

-----Original Message-----
From: Jérôme Charron [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 02, 2006 3:42 AM
To: [email protected]; [EMAIL PROTECTED]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


> This is something google does very well, and something nutch must 
> match to compete.

Richard, it seems you are a real pdf guru, so any code contribution to
nutch is welcome.
;-)

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE: Nutch Parsing PDFs, and general PDF extraction

Reply via email to