[
https://issues.apache.org/jira/browse/DRILL-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441538#comment-17441538
]
ASF GitHub Bot commented on DRILL-8028:
---------------------------------------
dzamo commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-964848311
> The bigger implication of having a `columns` array vs `field_n` is when a
user starts with `SELECT *` queries. It makes it harder for BI tools to gather
schema metadata and it also is non-standard SQL. Now... querying PDF is also
non-standard SQL... so maybe that's less important. But, it makes the discovery
a little harder IMHO.
@cgivre Oh right, that makes sense. So should we put in support for both
`columns[n]` and `field_n` as widely as possible, with a standardised option
which lets users switch between each mode? Maybe standardising on naming of
`columns[n]` and `column_n` is a small saving on cognitive load for users here?
@paul-rogers with apologies to this PR for saddling it with so much broader
design chat, I wanted to share a last set of findings from talking with others
with you and finally ask if we might go over a couple of questions with you,
away from this PR.
1. I've polled some Drill devs and going after the "long tail" of formats
and storage systems is mostly of interest to them. @vvysotskyi even has an
intriguing idea of a marketplace for these plugins, I guess something like the
Eclipse plugin marketplace.
2. I have developed a conviction that to go after the "long tail" and not
produce a sprawling mess that neither developers nor users want to touch, we
need to try to get strict (to the extent possible) about consistency in how
plugins behave and how they are configured. Today we already are not all that
consistent (e.g. see remarks on `columns[n]` vs `field_n` above, on column
`name` and `type` in fixed width format).
3. Those I've spoken with do also like the idea of splitting our distributed
packages into "core" and "kitchen sink", or something like that, to put us in a
better position to go after the "long tail". It sounds like we're okay with
our existing mono repo containing many plugins but end users should not have to
download the kitchen sink to query e.g. just JSON or Parquet. Drill startup
times will probably be slow for the kitchen sink because the Java class loader
will have a huge amount to scan. And developer testing could get onerous if we
cannot compile only a subset.
4. By chance I saw that BigQuery, which for some reason I've designated in
my mind of as kind of the Rolls Royce of Dremel-family engines even though I
know little about it, can query Google Sheets. So even they entertain some
"small data" formats, although nothing like what we're imagining. Just an
anecdote.
I would love to consult with you on 2 and 3 in a sort of "Very well, if you
_must_ do a distribution of Drill with this long tail of formats, storage
systems and UDFs in it then at least equip yourselves with the following
practices" chat. Perhaps in the upcoming community meetup, otherwise outside
(if it's of any interest on your end of course).
Thanks
James
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Add PDF Format Plugin
> ---------------------
>
> Key: DRILL-8028
> URL: https://issues.apache.org/jira/browse/DRILL-8028
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Other
> Affects Versions: 1.19.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.20.0
>
>
> See PR for documentation. This PR adds the ability to read tables contained
> in PDF files.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)