[
https://issues.apache.org/jira/browse/DRILL-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462067#comment-17462067
]
ASF GitHub Bot commented on DRILL-8028:
---------------------------------------
cgivre edited a comment on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-997322720
@paul-rogers
Thanks for all your review. I addressed all your comments (I think) and did
the following:
* Added additional unit tests
* Refactored table list so that all tables are not read into memory if not
requested
* Added iterator classes to avoid counters in the batch reader
* Moved metadata collection to separate class
* Refactored to allow a pdf with no tables to return metadata if requested
(And unit test)
* Added config option for different extraction algorithms.
* Removed extraneous test PDF files
* General code cleanup
I removed all but one of the `System.env` calls and I'm a little stuck on
this. The reason I added this line is that when querying a PDF with Drill in
embedded mode, it opens an additional java window. This does not occur when
running unit tests which makes for difficult debugging. I'm going to keep
digging into this, but I was wondering if you could take a look at the rest of
the revisions in the mean time? The issue seems to be in either Tabula or
PdfBox, which are the underlying libraries that read the PDF file. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Add PDF Format Plugin
> ---------------------
>
> Key: DRILL-8028
> URL: https://issues.apache.org/jira/browse/DRILL-8028
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Other
> Affects Versions: 1.19.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.20.0
>
>
> See PR for documentation. This PR adds the ability to read tables contained
> in PDF files.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)