That sounds like a great heuristic to me. I don't think that it would be all that bad to iterate through the page structure and create your own representations of lines as you pass over the page once.
Then, if you look at the histogram of line spacings, I think you can find an optimal break-point very easily. Font, size and color may be important cues as well for detecting headers and footers. On Sun, Sep 6, 2009 at 4:44 AM, Jason Harrop <jhar...@gmail.com> wrote: > I've been playing a little with adding paragraph markers in > PDFTextStripper. I'm using a crude algorithm which estimates normal > line spacing, and inserts a paragraph marker when a greater spacing is > detected. > > How best to do this. > > A first question is how important it is to avoid iterating over the > TextPosition objects in a page a second time? > -- Ted Dunning, CTO DeepDyve