yjshen commented on pull request #1010: URL: https://github.com/apache/arrow-datafusion/pull/1010#issuecomment-925786194
> Frankly speaking, I have never met a case of mixed file formats, so I wouldn't really know what is important to take into consideration. Can you describe your usecase precisely? Do you have an example of API that supports it? I may get this wrong if you are not implying it's possible that a table can have its data in different file formats in the original doc. I thought data lake implementations may store regular data in a columnar format such as parquet, and deltas (add or removes) in a row-based format like Avro or JSON, which makes me feel similar to SAP HANA http://www.vldb.org/pvldb/vol5/p061_jenskrueger_vldb2012.pdf and HyPer http://db.in.tum.de/downloads/publications/datablocks.pdf If it is not the case for existing implementations and not the intention of the PR, please just ignore this one. > By "reorder", you mean moving them to a separate folder? The physical_plan folder is pretty huge, so I thought restructuring it a bit wouldn't harm. Yes, I find it hard to tell what has been removed and what are the additions for the physical_plan files. I diff the files manually and find out `try_new_from_reader`s have been removed from JSON/Avro/CSV, I think it is worth some reasons for the removal to avoid something just slipped away silently. (because I'm not familiar with the use case of these existing readers). For the removal of `TableDescriptor` I think is fine since we now have `ListingTable`'s scan method that takes care of listing and partition files. But I what really want to argue is: why should we use diff manually and leave git diff away? Where can I find the original PR and related issue while reading code but it points to a reorganize PR with little information about it? Do I have to browse through all the discussions in this PR to find out another separate PR that contains the removal of the stale code? And dig again into the history of that removed file and finally find the right background I need? If we are following the rule that each PR should target at one problem, why should we tell apart reorganize PR into one addition and one stale-removal? And why do we involve restructure the physical_plan module for "nice symmetry" but leave "git will not recognize it as the previous files "? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
