[
https://issues.apache.org/jira/browse/DRILL-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438893#comment-17438893
]
ASF GitHub Bot commented on DRILL-8028:
---------------------------------------
paul-rogers commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-961327918
@cgivre, @dzamo raise good points. So, what is Drill today? Is it still
primarily used for distributed queries at scale? Or, as a handy desktop tool
for data scientists? Probably both. The problem is, a tool that tries to be
both a desert topping and a floor wax (let's see how old the readers are with
this one), ends up being good at neither.
One approach, if we had resources, would be to create a Drill Desktop that
is optimized for that case and encourages all kinds of specialized data
connectors. Create an easy way to define those connectors (YAML files created
by specialized web apps?) Ensure Drill has good integration with Jupyter and
the other usual suspects.
Another approach, if we had resources, is the oft-discussed idea of
separating the less-common plugins from the Drill core. Work started on this:
to create an extension mechanism that made this possible. (Today, most plugins
need quite a bit of Drill internal code.)
So, no harm in adding the PDF reader, but I expect usage will be pretty
limited just because, for the folks that need it, configuration will be too
hard. Better would be a Python or Spark job that extracts the data into a CSV
file, then query the CSV file with Drill. Each step could be debugged easily. I
can't imagine anyone will want to debug their PDF extraction using Drill's
overly generous Java stack traces...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Add PDF Format Plugin
> ---------------------
>
> Key: DRILL-8028
> URL: https://issues.apache.org/jira/browse/DRILL-8028
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Other
> Affects Versions: 1.19.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.20.0
>
>
> See PR for documentation. This PR adds the ability to read tables contained
> in PDF files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)