[
https://issues.apache.org/jira/browse/DRILL-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438447#comment-17438447
]
ASF GitHub Bot commented on DRILL-8028:
---------------------------------------
paul-rogers commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-959821275
Cool contribution. I'm not entirely convinced this is something that Drill
should handle. There are too many variables for the very limited controls which
Drill provides. It is likely that this will work for one or two limited use
cases, but not the vast majority of PDF files. Using JSON plugin config files
to specify the mapping is awkward. Probably each file will need its own config,
which is not scalable.
Drill's fundamental design is to run at scale. It is highly unlikely that
someone will use PDF files to store GB of data. If they do, they have problems
bigger than Drill can help them solve. Thus, this kind of plugin works only at
the small scale: one or two files in, say, an embedded Drillbit with JDBC or
SQLine.
A better choice would be to wrap this thing in a script: tinker with the PDF
extraction, using whatever tools are available, to get the right mapping. Then,
wrap that in a script that produces, say, a CSV format to stdout. Drill can
then read that input.
Such an approach enables all manner of ad-hoc, small scale data extraction.
Or, maybe Drill should offer a "desktop edition" that is designed for small,
ad-hoc projects based on local files, with some way to handle all the tinkering
needed when reading PDF files, images, Word files, spreadsheets, Twitter feeds,
Slack posts another formats popular with data scientists. Such features would
not normally be part of the massive-scale deployments for which Drill is
designed.
Thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Add PDF Format Plugin
> ---------------------
>
> Key: DRILL-8028
> URL: https://issues.apache.org/jira/browse/DRILL-8028
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Other
> Affects Versions: 1.19.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.20.0
>
>
> See PR for documentation. This PR adds the ability to read tables contained
> in PDF files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)