[jira] [Commented] (DRILL-8028) Add PDF Format Plugin

ASF GitHub Bot (Jira) Mon, 08 Nov 2021 04:50:04 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440435#comment-17440435
 ]


ASF GitHub Bot commented on DRILL-8028:
---------------------------------------

cgivre commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-963117787


   > @dzamo, the reference was to an early Saturday Night Live skit.
   > 
   > Just to refocus the discussion, my question is really around 
configuration. When querying CSV, Parquet and the like, it is very clear where 
the data lies. When querying Excel, PDF, HTML, Word, "the web", it is less 
clear: there is some amount of data mining needed to say that it is THIS table 
and not THAT one, that the table spans three pages, etc.
   > 
   > The question is, how does the user specify this? If it were me, I would 
not want to be tinkering with a JSON storage plugin config, watching my query 
fail with a Drill stack trace, or wondering why I got no data. Instead, I'd 
want a tool better suited for the task. Once I had that, if I then wanted to 
run at scale, I'd want to say, "Drill, just use X and consume the data."
   > 
   > So, the question here is: is the JSON storage plugin config an effective 
way for a data scientist to mine PDF, Excel, HTML and other messy sources? I 
don't have an answer, I'm just asking the question.
   > 
   > Again, if we had the ability to have true external plugins, then this 
could easily be an add-on project. Those who know PDF could go hog wild 
creating a good solution. But, we don't, so every specialized plugin has to be 
part of core Drill. Is that good? That's also a question for debate.
   
   @dzamo @paul-rogers , These are all interesting points.  I'd like to focus 
on the config variables for a moment.  The PDF reader has 2 config variables:  
the `mergeAllTables` and the `tableNumber`.   While a user could theoretically 
set these globally, they really are intended to be set via the `table()` 
function at query time.   For some reference the Excel reader has similar 
configs which allow a user to select different sheets within an Excel file, or 
define regions in a file where their data lives etc. 
   
   IMHO, this flexibility is actually very good for the user, because it allows 
an administrator to configure global default values that make sense but also 
allow a user to make query-time changes so they can access their data.
   
   With respect to the fixed-width plugin 
(https://github.com/apache/drill/pull/2282) I actually have a different vision 
of how this can be used, and will post comments there.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Add PDF Format Plugin
> ---------------------
>
>                 Key: DRILL-8028
>                 URL: https://issues.apache.org/jira/browse/DRILL-8028
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Other
>    Affects Versions: 1.19.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 1.20.0
>
>
> See PR for documentation.  This PR adds the ability to read tables contained 
> in PDF files. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (DRILL-8028) Add PDF Format Plugin

Reply via email to