A Dijous, 10 de novembre de 2011, Peter A. Kerzum vàreu escriure: > Alec, > > I'd like to opensource this feature, but I need to manage it first. > The code is not mine
Remember, if you link to poppler, you are obligued by the license to make your code GPL. Albert > > On Thursday 10 November 2011 00:12:09 Alec Taylor wrote: > > What are we looking for? > > > > Size? - Font? - Position? > > Yes, Size, face, Bold / Italic, UPPER | Title, number of alpha / non-alpha > symbols > > > - Previous/next page is copyright? > > > > 2011/11/10 Josh Richardson <[email protected]>: > > > The machine-learning approach seems like a good idea for finding > > > section headings, and maybe the title too. For finding the > > > document title, you might want to look at only the first, or maybe > > > first few pages, rather than every sentence in the document? > > > > > > --josh > > > > > > On 11/9/11 9:18 AM, "Alec Taylor" <[email protected]> wrote: > > >>On Thu, Nov 10, 2011 at 2:50 AM, Peter A. Kerzum > > >><[email protected]>> >> > > >>wrote: > > >>> Hi > > >>> > > >>>> Describe your method! > > >>> > > >>> - for every sentence of text get some numeric or boolean > > >>> properties, > > >>> > > >>>like > > >>> > > >>> font, layout and character distribution. > > >>> - use machine learning algorithm to build formula that maps > > >>> those > > >>> > > >>>properties > > >>> > > >>> to score > > >>> - for every document select the sentence with the greatest > > >>> score. > > >>> > > >>>Filter out > > >>> > > >>> some sentences, based on dictiory (like urls, etc) > > >>> > > >>> machine with 15 properties works reasonably well > > >> > > >>Doesn't sound very accurate... I can bring out ~98.5% accuracy. What > > >>are your initial estimations? > > >> > > >>> On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote: > > >>>> Hi Peter, > > >>>> > > >>>> > > >>>> Cheers, > > >>>> > > >>>> Alec Taylor > > >>>> > > >>>> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum > > >>>> > > >>>><[email protected]> > > >>>> > > >>> wrote: > > >>>> > Hi! > > >>>> > > > >>>> > We use some approach based on character properties to > > >>>> > extract > > >>>> > > >>>>meaningful > > >>>> > > >>>> > title from document text. Metadata usualy stores filename > > >>>> > in title > > >>>> > field. > > >>>> > > > >>>> > -- > > >>>> > Peter > > >>>> > > > >>>> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote: > > >>>> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid > > >>>> >> <[email protected]>> >>>> > > >>>>wrote: > > >>>> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure: > > >>>> >> >> Incorrect, all getDocInfo tells you is what the > > >>>> >> >> meta info says,> >>>> > > >>>>it > > >>>> > > >>>> >> >> doesn't analyse the actual document, whereas my > > >>>> >> >> pdftopdf will > > >>>> > > >>>>update > > >>>> > > >>>> >> >> the metadata with the appropriate info after PDF > > >>>> >> >> analysis > > >>>> >> > > > >>>> >> > Please do not top post, makes reading e-mail > > >>>> >> > incredibly hard. > > >>>> >> > > > >>>> >> > And no it is not incorrect, if the metadata does not > > >>>> >> > have a > > >>>> >> > title, then the document does not have a title as > > >>>> >> > defined per > > >>>> >> > the spec. > > >>>> >> > > > >>>> >> > Albert > > >>>> >> > > >>>> >> But maybe the document doesn't have a title, because it > > >>>> >> was grabbed from scanning the book, then OCRing it. So > > >>>> >> what I will facilitate is the generation of proper > > >>>> >> metadata (+ more) from a current PDF> >>>> > > >>>>lacking > > >>>> > > >>>> >> such. > > >>>> >> > > >>>> >> So if the document does have a title, my pdftopdf tool > > >>>> >> will find > > >>>> >> it, and add it to the metadata. > > >>>> >> > > >>>> >> I will contribute pdftopdf to poppler. > > >>>> >> _______________________________________________ > > >>>> >> poppler mailing list > > >>>> >> [email protected] > > >>>> >> http://lists.freedesktop.org/mailman/listinfo/poppler > > >>> > > >>> -- > > >>> Пётр Керзум > > >>> Группа разработки поисковой платформы > > >>> СПб, тел. 8508 > > >> > > >>_______________________________________________ > > >>poppler mailing list > > >>[email protected] > > >>http://lists.freedesktop.org/mailman/listinfo/poppler > > -- > Пётр Керзум > Группа разработки поисковой платформы > СПб, тел. 8508 > _______________________________________________ > poppler mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
