Re: [iText-questions] How to extract title / heading from document contents

Michael O'Donovan Sun, 26 Jun 2011 11:30:08 -0700

Hi Mark, Thanks for the response. Yes, in my case there are no tags and I will 
have to make some assumptions about determining the correct title. I can see 
how to extract the pure text, and fonts used in a page (through resources 
dictionary), but I havent seen part of the API which allows me to extract 
pieces of text along with the font and size being use for the section of text. 
Any ideas how I go about doing that? Michael
 Date: Fri, 24 Jun 2011 10:38:05 -0700
From: [email protected]
To: [email protected]
Subject: Re: [iText-questions] How to extract title / heading from document     
contents









iText can give you the font name, size, and location of all 
the text on the page.  Without PDF Structure, it is up to you to interpret 
that information.
 
For a title, you 
might consider all the text in the largest font on the first page to be the 
title.  Such heuristics will be brittle.  You could refine or relax it 
in various ways to work better with your particular PDFs, but at the end of the 
day, it will always be possible to find (or create) a PDF that will break your 
heuristic.
 
A biography of 
Theodore Roosevelt might be entitled:
 
Speak softly 
and 

Carry a Big 
Stick.
 
A reasonable 
heuristic could determine this title to be "Carry a Big Stick".  And 
it would be wrong.
 
--Mark Storer
  Senior Software 
Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer<Cardiff> 
DisCard = null;
 
 


  
  
  From: Balder [mailto:[email protected]] 
  
Sent: Friday, June 24, 2011 8:31 AM
To: 
  [email protected]
Subject: Re: [iText-questions] 
  How to extract title / heading from document contents


  This depends on the PDF,

 is the PDF Tagged? Then you 
  might be able to find out what's the title and heading. If it's not tagged 
  good luck with guessing the title and heading from the text found in the 
  document. 

On 24/06/2011 14:10, modie wrote: 
  Hi,

Sorry, I am new to iTextSharp and cannot find documentation for it anyway,
other than this forum. I am looking to extract content from a PDF document,
but I need to be able to understand the structure / markup in the document. 

I want to extract the heading / title for the document which would generally
found on the first page. Any ideas how I would do this? In html I would look
for the h1 or h2 tag?

PS - no, I dont want the title property of the document
 

--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/How-to-extract-title-heading-from-document-contents-tp3622357p3622357.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php


  -- 

  @redlabbe
redlab-log 


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] How to extract title / heading from document contents

Reply via email to