Alec, I'd like to opensource this feature, but I need to manage it first. The code is not mine
On Thursday 10 November 2011 00:12:09 Alec Taylor wrote: > What are we looking for? > > Size? - Font? - Position? Yes, Size, face, Bold / Italic, UPPER | Title, number of alpha / non-alpha symbols > - Previous/next page is copyright? > 2011/11/10 Josh Richardson <[email protected]>: > > The machine-learning approach seems like a good idea for finding section > > headings, and maybe the title too. For finding the document title, you > > might want to look at only the first, or maybe first few pages, rather > > than every sentence in the document? > > > > --josh > > > > On 11/9/11 9:18 AM, "Alec Taylor" <[email protected]> wrote: > >>On Thu, Nov 10, 2011 at 2:50 AM, Peter A. Kerzum <[email protected]> > >> > >>wrote: > >>> Hi > >>> > >>>> Describe your method! > >>> > >>> - for every sentence of text get some numeric or boolean properties, > >>> > >>>like > >>> > >>> font, layout and character distribution. > >>> - use machine learning algorithm to build formula that maps those > >>> > >>>properties > >>> > >>> to score > >>> - for every document select the sentence with the greatest score. > >>> > >>>Filter out > >>> > >>> some sentences, based on dictiory (like urls, etc) > >>> > >>> machine with 15 properties works reasonably well > >> > >>Doesn't sound very accurate... I can bring out ~98.5% accuracy. What > >>are your initial estimations? > >> > >>> On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote: > >>>> Hi Peter, > >>>> > >>>> > >>>> Cheers, > >>>> > >>>> Alec Taylor > >>>> > >>>> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum > >>>> > >>>><[email protected]> > >>>> > >>> wrote: > >>>> > Hi! > >>>> > > >>>> > We use some approach based on character properties to extract > >>>> > >>>>meaningful > >>>> > >>>> > title from document text. Metadata usualy stores filename in title > >>>> > field. > >>>> > > >>>> > -- > >>>> > Peter > >>>> > > >>>> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote: > >>>> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid <[email protected]> > >>>> > >>>>wrote: > >>>> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure: > >>>> >> >> Incorrect, all getDocInfo tells you is what the meta info says, > >>>> > >>>>it > >>>> > >>>> >> >> doesn't analyse the actual document, whereas my pdftopdf will > >>>> > >>>>update > >>>> > >>>> >> >> the metadata with the appropriate info after PDF analysis > >>>> >> > > >>>> >> > Please do not top post, makes reading e-mail incredibly hard. > >>>> >> > > >>>> >> > And no it is not incorrect, if the metadata does not have a > >>>> >> > title, then the document does not have a title as defined per > >>>> >> > the spec. > >>>> >> > > >>>> >> > Albert > >>>> >> > >>>> >> But maybe the document doesn't have a title, because it was grabbed > >>>> >> from scanning the book, then OCRing it. So what I will facilitate > >>>> >> is the generation of proper metadata (+ more) from a current PDF > >>>> > >>>>lacking > >>>> > >>>> >> such. > >>>> >> > >>>> >> So if the document does have a title, my pdftopdf tool will find > >>>> >> it, and add it to the metadata. > >>>> >> > >>>> >> I will contribute pdftopdf to poppler. > >>>> >> _______________________________________________ > >>>> >> poppler mailing list > >>>> >> [email protected] > >>>> >> http://lists.freedesktop.org/mailman/listinfo/poppler > >>> > >>> -- > >>> Пётр Керзум > >>> Группа разработки поисковой платформы > >>> СПб, тел. 8508 > >> > >>_______________________________________________ > >>poppler mailing list > >>[email protected] > >>http://lists.freedesktop.org/mailman/listinfo/poppler -- Пётр Керзум Группа разработки поисковой платформы СПб, тел. 8508 _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
