[GitHub] [drill] paul-rogers commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

GitBox Sun, 07 Nov 2021 12:11:36 -0800


paul-rogers commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-962673076



   @dzamo, the reference was to an early Saturday Night Live skit.
   
   Just to refocus the discussion, my question is really around configuration. 
When querying CSV, Parquet and the like, it is very clear where the data lies. 
When querying Excel, PDF, HTML, Word, "the web", it is less clear: there is 
some amount of data mining needed to say that it is THIS table and not THAT 
one, that the table spans three pages, etc.
   
   The question is, how does the user specify this? If it were me, I would not 
want to be tinkering with a JSON storage plugin config, watching my query fail 
with a Drill stack trace, or wondering why I got no data. Instead, I'd want a 
tool better suited for the task. Once I had that, if I then wanted to run at 
scale, I'd want to say, "Drill, just use X and consume the data."
   
   So, the question here is: is the JSON storage plugin config an effective way 
for a data scientist to mine PDF, Excel, HTML and other messy sources? I don't 
have an answer, I'm just asking the question.
   
   Again, if we had the ability to have true external plugins, then this could 
easily be an add-on project. Those who know PDF could go hog wild creating a 
good solution. But, we don't, so every specialized plugin has to be part of 
core Drill. Is that good? That's also a question for debate. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] paul-rogers commented on pull request #2359: DRILL-8028: Add PDF Format Plugin

Reply via email to