paul-rogers commented on pull request #2359: URL: https://github.com/apache/drill/pull/2359#issuecomment-962673076
@dzamo, the reference was to an early Saturday Night Live skit. Just to refocus the discussion, my question is really around configuration. When querying CSV, Parquet and the like, it is very clear where the data lies. When querying Excel, PDF, HTML, Word, "the web", it is less clear: there is some amount of data mining needed to say that it is THIS table and not THAT one, that the table spans three pages, etc. The question is, how does the user specify this? If it were me, I would not want to be tinkering with a JSON storage plugin config, watching my query fail with a Drill stack trace, or wondering why I got no data. Instead, I'd want a tool better suited for the task. Once I had that, if I then wanted to run at scale, I'd want to say, "Drill, just use X and consume the data." So, the question here is: is the JSON storage plugin config an effective way for a data scientist to mine PDF, Excel, HTML and other messy sources? I don't have an answer, I'm just asking the question. Again, if we had the ability to have true external plugins, then this could easily be an add-on project. Those who know PDF could go hog wild creating a good solution. But, we don't, so every specialized plugin has to be part of core Drill. Is that good? That's also a question for debate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org