cgivre edited a comment on pull request #2359: URL: https://github.com/apache/drill/pull/2359#issuecomment-997322720
@paul-rogers Thanks for all your review. I addressed all your comments (I think) and did the following: * Added additional unit tests * Refactored table list so that all tables are not read into memory if not requested * Added iterator classes to avoid counters in the batch reader * Moved metadata collection to separate class * Refactored to allow a pdf with no tables to return metadata if requested (And unit test) * Added config option for different extraction algorithms. * Removed extraneous test PDF files * General code cleanup I removed all but one of the `System.env` calls and I'm a little stuck on this. The reason I added this line is that when querying a PDF with Drill in embedded mode, it opens an additional java window. This does not occur when running unit tests which makes for difficult debugging. I'm going to keep digging into this, but I was wondering if you could take a look at the rest of the revisions in the mean time? The issue seems to be in either Tabula or PdfBox, which are the underlying libraries that read the PDF file. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
