[GitHub] [arrow-datafusion] yjshen commented on pull request #1010: Reorganize table providers by table format

GitBox Thu, 23 Sep 2021 05:50:41 -0700


yjshen commented on pull request #1010:
URL: https://github.com/apache/arrow-datafusion/pull/1010#issuecomment-925786194



   > Frankly speaking, I have never met a case of mixed file formats, so I 
wouldn't really know what is important to take into consideration. Can you 
describe your usecase precisely? Do you have an example of API that supports it?
   
   I may get this wrong if you are not implying it's possible that a table can 
have its data in different file formats in the original doc. 
   
   I thought data lake implementations may store regular data in a columnar 
format such as parquet, and deltas (add or removes) in a row-based format like 
Avro or JSON, which makes me feel similar to  SAP HANA 
http://www.vldb.org/pvldb/vol5/p061_jenskrueger_vldb2012.pdf and HyPer 
http://db.in.tum.de/downloads/publications/datablocks.pdf
   If it is not the case for existing implementations and not the intention of 
the PR, please just ignore this one.
   
   > By "reorder", you mean moving them to a separate folder? The physical_plan 
folder is pretty huge, so I thought restructuring it a bit wouldn't harm. 
   
   Yes, I find it hard to tell what has been removed and what are the additions 
for the physical_plan files.  
   
   I diff the files manually and find out `try_new_from_reader`s have been 
removed from JSON/Avro/CSV, I think it is worth some reasons for the removal to 
avoid something just slipped away silently. (because I'm not familiar with the 
use case of these existing readers). 
   
   For the removal of `TableDescriptor` I think is fine since we now have 
`ListingTable`'s scan method that takes care of listing and partition files.
   
   But I what really want to argue is: why should we use diff manually and 
leave git diff away? Where can I find the original PR and related issue while 
reading code but it points to a reorganize PR with little information about it? 
   Do I have to browse through all the discussions in this PR to find out 
another separate PR that contains the removal of the stale code? And dig again 
into the history of that removed file and finally find the right background I 
need?
   
   If we are following the rule that each PR should target at one problem, why 
should we tell apart reorganize PR into one addition and one stale-removal? And 
why do we involve restructure the physical_plan module for "nice symmetry" but 
leave "git will not recognize it as the previous files "?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen commented on pull request #1010: Reorganize table providers by table format

Reply via email to