What are we looking for? Size? - Font? - Position? - Previous/next page is copyright?
2011/11/10 Josh Richardson <[email protected]>: > The machine-learning approach seems like a good idea for finding section > headings, and maybe the title too. For finding the document title, you > might want to look at only the first, or maybe first few pages, rather > than every sentence in the document? > > --josh > > On 11/9/11 9:18 AM, "Alec Taylor" <[email protected]> wrote: > >>On Thu, Nov 10, 2011 at 2:50 AM, Peter A. Kerzum <[email protected]> >>wrote: >>> Hi >>> >>>> Describe your method! >>> >>> - for every sentence of text get some numeric or boolean properties, >>>like >>> font, layout and character distribution. >>> - use machine learning algorithm to build formula that maps those >>>properties >>> to score >>> - for every document select the sentence with the greatest score. >>>Filter out >>> some sentences, based on dictiory (like urls, etc) >>> >>> machine with 15 properties works reasonably well >> >>Doesn't sound very accurate... I can bring out ~98.5% accuracy. What >>are your initial estimations? >> >>> On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote: >>>> Hi Peter, >>>> >>>> >>>> Cheers, >>>> >>>> Alec Taylor >>>> >>>> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum >>>><[email protected]> >>> wrote: >>>> > Hi! >>>> > >>>> > We use some approach based on character properties to extract >>>>meaningful >>>> > title from document text. Metadata usualy stores filename in title >>>> > field. >>>> > >>>> > -- >>>> > Peter >>>> > >>>> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote: >>>> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid <[email protected]> >>>>wrote: >>>> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure: >>>> >> >> Incorrect, all getDocInfo tells you is what the meta info says, >>>>it >>>> >> >> doesn't analyse the actual document, whereas my pdftopdf will >>>>update >>>> >> >> the metadata with the appropriate info after PDF analysis >>>> >> > >>>> >> > Please do not top post, makes reading e-mail incredibly hard. >>>> >> > >>>> >> > And no it is not incorrect, if the metadata does not have a title, >>>> >> > then the document does not have a title as defined per the spec. >>>> >> > >>>> >> > Albert >>>> >> >>>> >> But maybe the document doesn't have a title, because it was grabbed >>>> >> from scanning the book, then OCRing it. So what I will facilitate is >>>> >> the generation of proper metadata (+ more) from a current PDF >>>>lacking >>>> >> such. >>>> >> >>>> >> So if the document does have a title, my pdftopdf tool will find it, >>>> >> and add it to the metadata. >>>> >> >>>> >> I will contribute pdftopdf to poppler. >>>> >> _______________________________________________ >>>> >> poppler mailing list >>>> >> [email protected] >>>> >> http://lists.freedesktop.org/mailman/listinfo/poppler >>> >>> -- >>> Пётр Керзум >>> Группа разработки поисковой платформы >>> СПб, тел. 8508 >>> >>_______________________________________________ >>poppler mailing list >>[email protected] >>http://lists.freedesktop.org/mailman/listinfo/poppler > > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
