On Thu, Nov 10, 2011 at 2:50 AM, Peter A. Kerzum <[email protected]> wrote: > Hi > >> Describe your method! > > - for every sentence of text get some numeric or boolean properties, like > font, layout and character distribution. > - use machine learning algorithm to build formula that maps those properties > to score > - for every document select the sentence with the greatest score. Filter out > some sentences, based on dictiory (like urls, etc) > > machine with 15 properties works reasonably well
Doesn't sound very accurate... I can bring out ~98.5% accuracy. What are your initial estimations? > On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote: >> Hi Peter, >> >> >> Cheers, >> >> Alec Taylor >> >> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum <[email protected]> > wrote: >> > Hi! >> > >> > We use some approach based on character properties to extract meaningful >> > title from document text. Metadata usualy stores filename in title >> > field. >> > >> > -- >> > Peter >> > >> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote: >> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid <[email protected]> wrote: >> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure: >> >> >> Incorrect, all getDocInfo tells you is what the meta info says, it >> >> >> doesn't analyse the actual document, whereas my pdftopdf will update >> >> >> the metadata with the appropriate info after PDF analysis >> >> > >> >> > Please do not top post, makes reading e-mail incredibly hard. >> >> > >> >> > And no it is not incorrect, if the metadata does not have a title, >> >> > then the document does not have a title as defined per the spec. >> >> > >> >> > Albert >> >> >> >> But maybe the document doesn't have a title, because it was grabbed >> >> from scanning the book, then OCRing it. So what I will facilitate is >> >> the generation of proper metadata (+ more) from a current PDF lacking >> >> such. >> >> >> >> So if the document does have a title, my pdftopdf tool will find it, >> >> and add it to the metadata. >> >> >> >> I will contribute pdftopdf to poppler. >> >> _______________________________________________ >> >> poppler mailing list >> >> [email protected] >> >> http://lists.freedesktop.org/mailman/listinfo/poppler > > -- > Пётр Керзум > Группа разработки поисковой платформы > СПб, тел. 8508 > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
