paul-rogers commented on pull request #2485: URL: https://github.com/apache/drill/pull/2485#issuecomment-1073482655
Another thought on the CSV (and other) readers: since they were originally designed for an HDFS-like file system, they may not be ideal when Drill is used as a data science tool against "classic" desktop-style files. For one thing, there is no need for distribution when reading a 10MB (or 100MB) CSV file. For another thing, the compromises made in the days of old with big data don't apply to desktop use. (In fact, much of Drill's machinery is overkill when Drill is run as a single process.) Drill 2.0 is an opportunity to reorient Drill away from fading big data space and toward the data science use cases that most PRs now seem to support. (It's not that big data itself is gone, it'd just that most folks who need that kind of scale now run in the cloud where Drill is not common.) As one of many examples, REST APIs make no sense at scale, but do make sense for a "small data" tool. Or, have two Drill additions, the old-school "distributed systems" edition and the newer "data science edition". Those who still need Drill to work distributed can keep that edition going (along with the big data CSV quirks), while the data science folks can fork the data science edition, chuck the distributed systems stuff that gets in the way, and focus on things that data scientists do (such as reading Excel and PDF files.) A step in that direction would be to create a "data science edition" for things like the CSV reader: have it require headers (to avoid the need to use the odd `columns` column.) Go ahead and sample the first 20 or 100 rows like Pandas does to infer types. Don't worry about distributed scans. And so on. Much of the distributed-systems weirdness can be ignored when running single process on small files. (Drill was created for the opposite reasons: the prior small data tools don't work at scale.) Users at scale will use the "distributed" versions of the readers, those who run Drill embedded, or single-server can enable the "desktop" versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org