[GitHub] [drill] paul-rogers commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

GitBox Sun, 20 Mar 2022 21:48:53 -0700


paul-rogers commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1073482655



   Another thought on the CSV (and other) readers: since they were originally 
designed for an HDFS-like file system, they may not be ideal when Drill is used 
as a data science tool against "classic" desktop-style files. For one thing, 
there is no need for distribution when reading a 10MB (or 100MB) CSV file. For 
another thing, the compromises made in the days of old with big data don't 
apply to desktop use. (In fact, much of Drill's machinery is overkill when 
Drill is run as a single process.)
   
   Drill 2.0 is an opportunity to reorient Drill away from fading big data 
space and toward the data science use cases that most PRs now seem to support. 
(It's not that big data itself is gone, it'd just that most folks who need that 
kind of scale now run in the cloud where Drill is not common.) As one of many 
examples, REST APIs make no sense at scale, but do make sense for a "small 
data" tool.
   
   Or, have two Drill additions, the old-school "distributed systems" edition 
and the newer "data science edition". Those who still need Drill to work 
distributed can keep that edition going (along with the big data CSV quirks), 
while the data science folks can fork the data science edition, chuck the 
distributed systems stuff that gets in the way, and focus on things that data 
scientists do (such as reading Excel and PDF files.)
   
   A step in that direction would be to create a "data science edition" for 
things like the CSV reader: have it require headers (to avoid the need to use 
the odd `columns` column.) Go ahead and sample the first 20 or 100 rows like 
Pandas does to infer types. Don't worry about distributed scans. And so on. 
Much of the distributed-systems weirdness can be ignored when running single 
process on small files. (Drill was created for the opposite reasons: the prior 
small data tools don't work at scale.)
   
   Users at scale will use the "distributed" versions of the readers, those who 
run Drill embedded, or single-server can enable the "desktop" versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [drill] paul-rogers commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Reply via email to