IMHO a task for GSoC should be non-critical, localized, and not a user interface. A "non-critical" is one where PDFBOX development can continue without relying on the project result. A "localized" project is one that can be incorporated into the code base with few changes to the base. This will limit the effort required to learn about the system into which the effort will fit. A "user-interface" implements an interactive window or an API. I have low expectations of the capabilities of students for doing good designs in these areas.
So I looked through JIRA for open projects meeting the above. Since I am not all that familiar with PDFBOX, some of my suggestions may be laughable and surely I have missed some. Nonetheless, here's what I found: PDFBOX-553 writing pdf file in Japanese, garbled PDFBOX-570 Windings font recognition + spacing issue PDFBOX-605 Better support for Type0 fonts PDFBOX-678 Support missing Text Rendering Modes when rendering a PDF PDFBOX-870 PDF-To-IMAGE output is not anti-aliased PDFBOX-1094 Pattern colorspace support PDFBOX-1594 Add support for AES256 Encryption (see also PDFBOX-1450 document how to encrypt with AES 256 ) PDFBOX-1734 ImageIoUtil.WriteImage doesn't work with tiff images PDFBOX-1843 Find a way to test PDFToImage >________________________________ > From: John Hewson <[email protected]> >To: "[email protected]" <[email protected]> >Sent: Wednesday, January 29, 2014 6:38 PM >Subject: Re: [DISCUSS] GSoC Participation > > >> - an idea which came up some years ago, was to implement a gui-interface to >> bundle some/all/future tools/features of pdfbox, like printing, rendering, >> preflight, split, merge etc. > >The AWT/Swing PDF viewer could do with rewriting. But does anyone want that? >Maybe support for JavaFX? > >> - a high-level api to create pdfs > >I've been thinking about this recently and have come to the conclusion that >it's really hard to do well. > >> - an advanced text extractor with table/column support > >The table stuff sounds a lot like Tabula? Do we really not have column >support? We need that! > >I'll throw in some ideas too: > >- an interface for OCR engines to plug into the text extraction API. It could >provide access to extracted images or allow badly encoded fonts to be passed >to OCR one character or text run at a time. > >- > >-- John > > >> On 29 Jan 2014, at 03:20, Andreas Lehmkühler <[email protected]> wrote: >> >> Hi, >> >>> Maruan Sahyoun <[email protected]> hat am 29. Januar 2014 um 10:44 >>> geschrieben: >>> >>> >>> Hi >>> >>> shall we try to participate at GSoC? Needs a mentor though. >> That idea already came up from time to time and it didn't work for different >> reasons. >> >> So, to participate we need a mentor and or course at least one good idea to >> pe >> proposed. >> >> I won't act as mentor for different reasons but I'll try to help in the >> normal >> manner. >> >> IMO an appropriate idea shall not deal with pdf-specific low-level features, >> like linearization support, as I doubt that any possible student is familiar >> with the pdf-spec. >> >> So possible ideas could be: >> >> - an idea which came up some years ago, was to implement a gui-interface to >> bundle some/all/future tools/features of pdfbox, like printing, rendering, >> preflight, split, merge etc. >> - a high-level api to create pdfs >> - an advanced text extractor with table/column support >> >> >>> BR >>> >>> Maruan Sahyoun >> >> BR >> Andreas Lehmkühler > >
