Hi Mark, Thanks for the response. Yes, in my case there are no tags and I will
have to make some assumptions about determining the correct title. I can see
how to extract the pure text, and fonts used in a page (through resources
dictionary), but I havent seen part of the API which allows me to extract
pieces of text along with the font and size being use for the section of text.
Any ideas how I go about doing that? Michael
Date: Fri, 24 Jun 2011 10:38:05 -0700
From: [email protected]
To: [email protected]
Subject: Re: [iText-questions] How to extract title / heading from document
contents
iText can give you the font name, size, and location of all
the text on the page. Without PDF Structure, it is up to you to interpret
that information.
For a title, you
might consider all the text in the largest font on the first page to be the
title. Such heuristics will be brittle. You could refine or relax it
in various ways to work better with your particular PDFs, but at the end of the
day, it will always be possible to find (or create) a PDF that will break your
heuristic.
A biography of
Theodore Roosevelt might be entitled:
Speak softly
and
Carry a Big
Stick.
A reasonable
heuristic could determine this title to be "Carry a Big Stick". And
it would be wrong.
--Mark Storer
Senior Software
Engineer
Cardiff.com
import legalese.Disclaimer;
Disclaimer<Cardiff>
DisCard = null;
From: Balder [mailto:[email protected]]
Sent: Friday, June 24, 2011 8:31 AM
To:
[email protected]
Subject: Re: [iText-questions]
How to extract title / heading from document contents
This depends on the PDF,
is the PDF Tagged? Then you
might be able to find out what's the title and heading. If it's not tagged
good luck with guessing the title and heading from the text found in the
document.
On 24/06/2011 14:10, modie wrote:
Hi,
Sorry, I am new to iTextSharp and cannot find documentation for it anyway,
other than this forum. I am looking to extract content from a PDF document,
but I need to be able to understand the structure / markup in the document.
I want to extract the heading / title for the document which would generally
found on the first page. Any ideas how I would do this? In html I would look
for the h1 or h2 tag?
PS - no, I dont want the title property of the document
--
View this message in context:
http://itext-general.2136553.n4.nabble.com/How-to-extract-title-heading-from-document-contents-tp3622357p3622357.html
Sent from the iText - General mailing list archive at Nabble.com.
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense..
http://p.sf.net/sfu/splunk-d2d-c1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php
--
@redlabbe
redlab-log
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense..
http://p.sf.net/sfu/splunk-d2d-c1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php