+1 I think this sounds like a great idea. It would speed up getting those sorts of changes into PDFBox proper.
-----Original Message----- From: Jukka Zitting [mailto:[email protected]] Sent: Monday, December 19, 2011 10:12 AM To: PDFBox Development Subject: Tika parser in PDFBox Hi, As you may have noticed in PDFBOX-1132 [1], I wanted to try pushing the PDF parser in Tika to PDFBox for easier and faster deployment of latest fixes and improvements. It seems to work pretty well, so I'm thinking of making this move permanent. See below for the message I sent to dev@tika about this approach. The discussion on dev@tika brought up some concerns about how to best maintain consistency across Tika parsers if they're located in upstream parser libraries. The solution I had in mind was granting Tika committers write access to the relevant parts of PDFBox. My idea is that Tika committers working on the PDFBox-based Tika parser for PDF could commit those changes directly into PDFBox, from where they'd be released as a part of the normal PDFBox releases under the oversight of the PDFBox PMC. Active committers like Michael McCandless who focus more on PDF parsing could even be invited as normal PDFBox committers. I think such a solution would make it easier to improve the PDF parsing code directly in PDFBox instead of introducing workarounds and other extra code like what's currently been happening in Tika. For example in TIKA-738 Michael extended the PDFTextStripper class with support for annotation handling. Such improvements should ideally have gone into the PDFTextStripper class itself instead of just to the downstream code in Tika. WDYT? I'm planning to call a vote on extending PDFBox commit access also to Tika committers for this. Please share any concerns or questions so we can discuss and hopefully address them before the vote. [1] https://issues.apache.org/jira/browse/PDFBOX-1132 BR, Jukka Zitting ---------- Forwarded message ---------- From: Jukka Zitting <[email protected]> Date: Tue, Dec 13, 2011 at 10:42 AM Subject: Pushing parsers upstream To: Tika Development <[email protected]> Hi, As you know, we see a lot of questions about version mismatches (which POI or PDFBox version should go with this Tika version) and there's a long queue of patches that are waiting for new official releases of our upstream dependencies to become available. To avoid this issue I propose that we start moving some of our parser implementations to upstream projects. Now with Tika 1.0 out we have a stable Parser and Detector interfaces and related APIs that upstream libraries could implement directly without us having to worry about changing Tika code whenever a new version of a parser library becomes available. This would allow our users to for example directly upgrade to a new POI version without waiting for a releated Tika release first. Similarly, a new PDF parsing option or improvement could be implemented directly in PDFBox and be usable without any code changes in Tika. The classloading and OSGi service mechanisms we've added should make such upstream Parser implementations trivially easy to use, and we could still keep the dependencies in tika-parsers as a way to pull in the libraries even if the relevant implementation classes would no longer reside in org.apache.tika.parsers.*. In addition to some of the GPL libraries for which we've already done this, I recently took the liberty of trying this out also with PDFBox. See PDFBOX-1132 [1] for the issue where I copied the org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works without problems, so now I'd like to propose that we copy any more recent PDF parser changes to PDFBox and prepare to drop the parser implementation in tika-parsers. Any further PDF parser work should then be done directly in PDFBox. I haven't yet talked about this with the PDFBox PMC (of which I'm a member), but I suppose we should be able to come up with an arrangement where Tika committers can commit directly to the Tika parser implementation in PDFBox. It would be cool if we could do the same thing also with POI. WDYT? [1] https://issues.apache.org/jira/browse/PDFBOX-1132 BR, Jukka Zitting
smime.p7s
Description: S/MIME cryptographic signature
