----------------------------------------
> Date: Mon, 21 Jun 2010 09:49:44 +0100
> From: b...@benshort.co.uk
> To: itext-questions@lists.sourceforge.net
> Subject: Re: [iText-questions] NPE while Extracting text
>
> Thanks very much for this information.
>
> Maybe you could offer me some direction of how to solve my problem?
>
> I need to parse pdf mobile phone bills. the information i require is
> the itemized data that is in a table format. Is this possible with
> itextpdf?


I know this won't help you but let's be clear- pdf is NOT the format
of choice for DATA or INFORMATION. It is generally about
human readability- and while this often has a describable structure,
everyone here tells me it is too complicated to include that in the
PDF file. If you have a choice, and have a cooperative relationship
with the source of the documents, you want an INFORMATION
format, not a bunch of pixels. "Scraping" html or pdf is
often done by people trying to extract information from artwork
but you always need to make assumptions about the document
structure. If you want a robust means to do this,
at least workout some conventions with the document authors.

The great leap in information representation in going from
pictures to an alphabet is that fonts don't matter. You
probably want to extract the text and scrap the font
stuff. If text can not be extracted easily from the PDF itself,
you need to reduce it to pixels and then extract with
OCR software. Or, get the document author to only include
the important stuff to begin with.

>
> On 19 June 2010 08:44, 1T3XT info  wrote:
>> Ben Short wrote:
>>> subType is /Type3
>>>
>>> Does this help identify the problem?
>>
>> Yes, but it doesn't bring us closer to a solution.
>>
>> Type 3 fonts are "user defined fonts".
>>
>> See for instance:
>> http://itextpdf.com/examples/index.php?page=example&id=200
>> In that example, a 'delta' and 'sigma' shaped glyph was defined,
>> corresponding with the characters 'D' and 'S'. However, the example
>> would also have worked if we'd used any other character.
>>
>> Another example: we could define a glyph that looks like the symbol for
>> 'The Artist Formerly Known As Prince' to correspond with the character
>> 'P'. That's what Type 3 fonts are about: they can be used when a user
>> needs a glyph that isn't provided in any other font.
>> Therefore it's very hard to extract that content: how are you going to
>> know that the glyph corresponding with 'P' needs to be 'translated' to
>> 'The Artist Formerly Known As Prince'? I don't think there's a UNICODE
>> code point for that glyph.
>>
>> I think you've hit a limitation regarding text extraction in general.
>> --
>> This answer is provided by 1T3XT BVBA
>> http://www.1t3xt.com/ - http://www.1t3xt.info
>>
>> ------------------------------------------------------------------------------
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> _______________________________________________
>> iText-questions mailing list
>> iText-questions@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>> Buy the iText book: http://www.itextpdf.com/book/
>> Check the site with examples before you ask questions: 
>> http://www.1t3xt.info/examples/
>> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
>>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit. See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions: 
> http://www.1t3xt.info/examples/
> You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
                                          
_________________________________________________________________
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Reply via email to