[
https://issues.apache.org/jira/browse/DRILL-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439932#comment-17439932
]
ASF GitHub Bot commented on DRILL-8028:
---------------------------------------
dzamo commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-962569638
> The problem is, a tool that tries to be both a desert topping and a floor
wax (let's see how old the readers are with this one), ends up being good at
neither.
@paul-rogers you got me with this idiom, but I like it! The broader topic
is super interesting. If SQLite started adding the features needed to compete
with Oracle Database 21c it would quickly fail at being SQLite. If Linux tried
to be an OS kernel for both TVs and supercomputers it would... continue to
dominate both extremes! There are some twists here!
Pigeonholing formats into small scale and large scales is also a tricky
business. For example, we naturally want to declare PDF a desktop format, but
I can easily imagine a conversation like the following.
"Hey Bob, remember that we sent decades of paper archives from the basement
out to that big scanning centre for digitisation? They've come back as
millions of pages of PDFs. Someone just asked me if we can help them find all
invoices containing a particular SKU, and pull out the price on that line. The
ERP system only has the last 10 years loaded into it and they want to go back
further".
"Chuck 'em in HDFS, we'll run a Drill query"
"But PDF is a desktop publishing format, not a big data format! Surely our
big data cluster will want nothing to do with it!?"
"Drill's got a plugin architecture which led to people adding support for
all sorts of weird and wonderful formats. Querying PDFs is a dubious business
but we'll know after ~10 lines of SQL if we can do this with Drill or not. If
not, miserable days or weeks of programming with a PDF library await one of our
interns."
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Add PDF Format Plugin
> ---------------------
>
> Key: DRILL-8028
> URL: https://issues.apache.org/jira/browse/DRILL-8028
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Other
> Affects Versions: 1.19.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.20.0
>
>
> See PR for documentation. This PR adds the ability to read tables contained
> in PDF files.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)