Re: [poppler] Extract title from pdf file.

Albert Astals Cid Thu, 10 Nov 2011 07:07:23 -0800

A Dijous, 10 de novembre de 2011, Peter A. Kerzum vàreu escriure:
> Alec,
> 
> I'd like to opensource this feature, but I need to manage it first.
> The code is not mine


Remember, if you link to poppler, you are obligued by the license to make your 
code GPL.

Albert

> 
> On Thursday 10 November 2011 00:12:09 Alec Taylor wrote:
> > What are we looking for?
> > 
> > Size? - Font? - Position?
> 
> Yes, Size, face, Bold / Italic, UPPER | Title, number of alpha / non-alpha
> symbols
> 
> > - Previous/next page is copyright?
> > 
> > 2011/11/10 Josh Richardson <[email protected]>:
> > > The machine-learning approach seems like a good idea for finding
> > > section headings, and maybe the title too.  For finding the
> > > document title, you might want to look at only the first, or maybe
> > > first few pages, rather than every sentence in the document?
> > > 
> > > --josh
> > > 
> > > On 11/9/11 9:18 AM, "Alec Taylor" <[email protected]> wrote:
> > >>On Thu, Nov 10, 2011 at 2:50 AM, Peter A. Kerzum
> > >><[email protected]>> >>
> > >>wrote:
> > >>> Hi
> > >>> 
> > >>>> Describe your method!
> > >>> 
> > >>> - for every sentence of text get some numeric or boolean
> > >>> properties,
> > >>>
> > >>>like
> > >>>
> > >>> font, layout and character distribution.
> > >>> - use machine learning algorithm to build formula that maps
> > >>> those
> > >>>
> > >>>properties
> > >>>
> > >>> to score
> > >>> - for every document select the sentence with the greatest
> > >>> score.
> > >>>
> > >>>Filter out
> > >>>
> > >>> some sentences, based on dictiory (like urls, etc)
> > >>> 
> > >>> machine with 15 properties works reasonably well
> > >>
> > >>Doesn't sound very accurate... I can bring out ~98.5% accuracy. What
> > >>are your initial estimations?
> > >>
> > >>> On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote:
> > >>>> Hi Peter,
> > >>>> 
> > >>>> 
> > >>>> Cheers,
> > >>>> 
> > >>>> Alec Taylor
> > >>>> 
> > >>>> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum
> > >>>>
> > >>>><[email protected]>
> > >>>>
> > >>> wrote:
> > >>>> > Hi!
> > >>>> > 
> > >>>> > We use some approach based on character properties to
> > >>>> > extract
> > >>>>
> > >>>>meaningful
> > >>>>
> > >>>> > title from document text. Metadata usualy stores filename
> > >>>> > in title
> > >>>> > field.
> > >>>> > 
> > >>>> > --
> > >>>> > Peter
> > >>>> > 
> > >>>> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote:
> > >>>> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid
> > >>>> >> <[email protected]>> >>>>
> > >>>>wrote:
> > >>>> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure:
> > >>>> >> >> Incorrect, all getDocInfo tells you is what the
> > >>>> >> >> meta info says,> >>>>
> > >>>>it
> > >>>>
> > >>>> >> >> doesn't analyse the actual document, whereas my
> > >>>> >> >> pdftopdf will
> > >>>>
> > >>>>update
> > >>>>
> > >>>> >> >> the metadata with the appropriate info after PDF
> > >>>> >> >> analysis
> > >>>> >> > 
> > >>>> >> > Please do not top post, makes reading e-mail
> > >>>> >> > incredibly hard.
> > >>>> >> > 
> > >>>> >> > And no it is not incorrect, if the metadata does not
> > >>>> >> > have a
> > >>>> >> > title, then the document does not have a title as
> > >>>> >> > defined per
> > >>>> >> > the spec.
> > >>>> >> > 
> > >>>> >> > Albert
> > >>>> >> 
> > >>>> >> But maybe the document doesn't have a title, because it
> > >>>> >> was grabbed from scanning the book, then OCRing it. So
> > >>>> >> what I will facilitate is the generation of proper
> > >>>> >> metadata (+ more) from a current PDF> >>>>
> > >>>>lacking
> > >>>>
> > >>>> >> such.
> > >>>> >> 
> > >>>> >> So if the document does have a title, my pdftopdf tool
> > >>>> >> will find
> > >>>> >> it, and add it to the metadata.
> > >>>> >> 
> > >>>> >> I will contribute pdftopdf to poppler.
> > >>>> >> _______________________________________________
> > >>>> >> poppler mailing list
> > >>>> >> [email protected]
> > >>>> >> http://lists.freedesktop.org/mailman/listinfo/poppler
> > >>> 
> > >>> --
> > >>> Пётр Керзум
> > >>> Группа разработки поисковой платформы
> > >>> СПб, тел. 8508
> > >>
> > >>_______________________________________________
> > >>poppler mailing list
> > >>[email protected]
> > >>http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> --
> Пётр Керзум
> Группа разработки поисковой платформы
> СПб, тел. 8508
> _______________________________________________
> poppler mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/poppler
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] Extract title from pdf file.

Reply via email to