I didn't notice that you were using SimpleTextExtractionStrategy.  You should
definitely try using the default text extraction strategy
(LocationTextExtractionStrategy) - it's a lot better at pulling meaningful
text from PDFs.

As for doing text matching while you are doing the extraction, you certainly
can do that by writing your own text extraction strategy, but I doubt very
much that it would be worth doing it that way - the time cost of parsing the
PDF is *way* higher than any post-processing step you might be performing.

That said, it looks like you are doing a bunch of regex substitutions, which
could be a performance bottleneck.  I'd suggest that you take the text from
the extraction strategy, then do a single pass parse through it doing your
substitution/etc... - that's not an iText question, really - just a
generalized text processing question.



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Search-Text-and-Capacity-of-iText-to-read-tp4657270p4657280.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Master SQL Server Development, Administration, T-SQL, SSAS, SSIS, SSRS
and more. Get SQL Server skills now (including 2012) with LearnDevNow -
200+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only - learn more at:
http://p.sf.net/sfu/learnmore_122512
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to