[
https://issues.apache.org/jira/browse/DRILL-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440256#comment-17440256
]
ASF GitHub Bot commented on DRILL-8028:
---------------------------------------
dzamo commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-962895343
@paul-rogers, right, okay it's an expressiveness thing here rather than a
scale thing. The expressiveness of Drill SQL ∪ Drill format config JSON falls
well short of that of a general purpose scripting language and for reading
fiddly unstructured data that shortfall might rapidly become uncomfortable.
The format config for this particular plugin looks quite succinct, like the
plugin will either automagically get your data out, or it won't and then you
need to pack up and go and open the interpreter of your favourite scripting
language. Making your resulting script scale to millions of pages, if it
that's needed, is left to the student. I quite like the Ray project for Python
myself.
This thread has triggered some thoughts. If we find ourselves starting to
write long essays of JSON in format configs then we should probably be
concerned. If we find ourselves trying to embed a miniature data processing
DSL into format config JSON then we need to stop moving immediately and pray to
the ancestors that we might be shown a path that will return us from
wilderness. I want to revisit the draft fixed width format plugin with these
ideas in mind. Its config allows setting names and types for columns, but for
other formats we must do this in SQL. I think we should only ever do this in
SQL.
I think we can do something on the packaging front. These format plugins
live under contrib/ in the source tree and are compiled to their own jar files.
If we simply change the final tarball-building stage of our Maven build to
give us something like the following on our download page, would we not be in
reasonable shape?
Package|Size|Description
--|--|--
drill-core|300MB|Drill with core storage layer libs only. Use this in a
focussed big data environment to query standard formats like Parquet, CSV and
JSON in HDFS or object storage with predictable results and performance.
Supplement this with indiviudal plugins listed below as needed.
drill-ktichen-sink|1.5GB|Drill core plus all 100+ storage and format
plugins. Use this for maximum compatibility. Results and performance may vary
across plugins.
drill-storage-jdbc|130KB|Plugin to query systems that provide a JDBC driver
using a generic SQL dialect.
drill-format-pdf|90KB|Plugin to query tables scraped from PDF files.
...
P.S. We'd be persisting with a monolithic Git repo containing multiple
"projects" here, but I personally don't mind mono repos.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Add PDF Format Plugin
> ---------------------
>
> Key: DRILL-8028
> URL: https://issues.apache.org/jira/browse/DRILL-8028
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Other
> Affects Versions: 1.19.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.20.0
>
>
> See PR for documentation. This PR adds the ability to read tables contained
> in PDF files.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)