[
https://issues.apache.org/jira/browse/DRILL-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440435#comment-17440435
]
ASF GitHub Bot commented on DRILL-8028:
---------------------------------------
cgivre commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-963117787
> @dzamo, the reference was to an early Saturday Night Live skit.
>
> Just to refocus the discussion, my question is really around
configuration. When querying CSV, Parquet and the like, it is very clear where
the data lies. When querying Excel, PDF, HTML, Word, "the web", it is less
clear: there is some amount of data mining needed to say that it is THIS table
and not THAT one, that the table spans three pages, etc.
>
> The question is, how does the user specify this? If it were me, I would
not want to be tinkering with a JSON storage plugin config, watching my query
fail with a Drill stack trace, or wondering why I got no data. Instead, I'd
want a tool better suited for the task. Once I had that, if I then wanted to
run at scale, I'd want to say, "Drill, just use X and consume the data."
>
> So, the question here is: is the JSON storage plugin config an effective
way for a data scientist to mine PDF, Excel, HTML and other messy sources? I
don't have an answer, I'm just asking the question.
>
> Again, if we had the ability to have true external plugins, then this
could easily be an add-on project. Those who know PDF could go hog wild
creating a good solution. But, we don't, so every specialized plugin has to be
part of core Drill. Is that good? That's also a question for debate.
@dzamo @paul-rogers , These are all interesting points. I'd like to focus
on the config variables for a moment. The PDF reader has 2 config variables:
the `mergeAllTables` and the `tableNumber`. While a user could theoretically
set these globally, they really are intended to be set via the `table()`
function at query time. For some reference the Excel reader has similar
configs which allow a user to select different sheets within an Excel file, or
define regions in a file where their data lives etc.
IMHO, this flexibility is actually very good for the user, because it allows
an administrator to configure global default values that make sense but also
allow a user to make query-time changes so they can access their data.
With respect to the fixed-width plugin
(https://github.com/apache/drill/pull/2282) I actually have a different vision
of how this can be used, and will post comments there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Add PDF Format Plugin
> ---------------------
>
> Key: DRILL-8028
> URL: https://issues.apache.org/jira/browse/DRILL-8028
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Other
> Affects Versions: 1.19.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.20.0
>
>
> See PR for documentation. This PR adds the ability to read tables contained
> in PDF files.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)