[jira] [Commented] (DRILL-8028) Add PDF Format Plugin

ASF GitHub Bot (Jira) Sun, 07 Nov 2021 23:55:10 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440256#comment-17440256
 ]


ASF GitHub Bot commented on DRILL-8028:
---------------------------------------

dzamo commented on pull request #2359:
URL: https://github.com/apache/drill/pull/2359#issuecomment-962895343


   @paul-rogers, right, okay it's an expressiveness thing here rather than a 
scale thing.  The expressiveness of Drill SQL ∪ Drill format config JSON falls 
well short of that of a general purpose scripting language and for reading 
fiddly unstructured data that shortfall might rapidly become uncomfortable.  
The format config for this particular plugin looks quite succinct, like the 
plugin will either automagically get your data out, or it won't and then you 
need to pack up and go and open the interpreter of your favourite scripting 
language.  Making your resulting script scale to millions of pages, if it 
that's needed, is left to the student.  I quite like the Ray project for Python 
myself.
   
   This thread has triggered some thoughts.  If we find ourselves starting to 
write long essays of JSON in format configs then we should probably be 
concerned.  If we find ourselves trying to embed a miniature data processing 
DSL into format config JSON then we need to stop moving immediately and pray to 
the ancestors that we might be shown a path that will return us from 
wilderness.  I want to revisit the draft fixed width format plugin with these 
ideas in mind.  Its config allows setting names and types for columns, but for 
other formats we must do this in SQL.  I think we should only ever do this in 
SQL.
   
   I think we can do something on the packaging front.  These format plugins 
live under contrib/ in the source tree and are compiled to their own jar files. 
 If we simply change the final tarball-building stage of our Maven build to 
give us something like the following on our download page, would we not be in 
reasonable shape? 
   
   Package|Size|Description
   --|--|--
   drill-core|300MB|Drill with core storage layer libs only.  Use this in a 
focussed big data environment to query standard formats like Parquet, CSV and 
JSON in HDFS or object storage with predictable results and performance.  
Supplement this with indiviudal plugins listed below as needed.
   drill-ktichen-sink|1.5GB|Drill core plus all 100+ storage and format 
plugins.  Use this for maximum compatibility.  Results and performance may vary 
across plugins.
   drill-storage-jdbc|130KB|Plugin to query systems that provide a JDBC driver 
using a generic SQL dialect.
   drill-format-pdf|90KB|Plugin to query tables scraped from PDF files.
   ...
   
   P.S. We'd be persisting with a monolithic Git repo containing multiple 
"projects" here, but I personally don't mind mono repos.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Add PDF Format Plugin
> ---------------------
>
>                 Key: DRILL-8028
>                 URL: https://issues.apache.org/jira/browse/DRILL-8028
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Other
>    Affects Versions: 1.19.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 1.20.0
>
>
> See PR for documentation.  This PR adds the ability to read tables contained 
> in PDF files. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (DRILL-8028) Add PDF Format Plugin

Reply via email to