[GitHub] [drill] cgivre edited a comment on pull request #2359: DRILL-8028: Add PDF Format Plugin

GitBox Sat, 18 Dec 2021 19:24:04 -0800


cgivre edited a comment on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-997322720



   @paul-rogers 
   Thanks for all your review.  I addressed all your comments (I think) and did 
the following:
   * Added additional unit tests
   * Refactored table list so that all tables are not read into memory if not 
requested
   * Added iterator classes to avoid counters in the batch reader
   * Moved metadata collection to separate class
   * Refactored to allow a pdf with no tables to return metadata if requested 
(And unit test)
   * Added config option for different extraction algorithms.
   * Removed extraneous test PDF files
   * General code cleanup
   
   I removed all but one of the `System.env` calls and I'm a little stuck on 
this.  The reason I added this line is that when querying a PDF with Drill in 
embedded mode, it opens an additional java window.  This does not occur when 
running unit tests which makes for difficult debugging.   I'm going to keep 
digging into this, but I was wondering if you could take a look at the rest of 
the revisions in the mean time?   The issue seems to be in either Tabula or 
PdfBox, which are the underlying libraries that read the PDF file. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] cgivre edited a comment on pull request #2359: DRILL-8028: Add PDF Format Plugin

Reply via email to