That sounds like a great heuristic to me.  I don't think that it would be
all that bad to iterate through the page structure and create your own
representations of lines as you pass over the page once.

Then, if you look at the histogram of line spacings, I think you can find an
optimal break-point very easily.  Font, size and color may be important cues
as well for detecting headers and footers.

On Sun, Sep 6, 2009 at 4:44 AM, Jason Harrop <jhar...@gmail.com> wrote:

> I've been playing a little with adding paragraph markers in
> PDFTextStripper.  I'm using a crude algorithm which estimates normal
> line spacing, and inserts a paragraph marker when a greater spacing is
> detected.
>
> How best to do this.
>
> A first question is how important it is to avoid iterating over the
> TextPosition objects in a page a second time?
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to