I just wanted to bounce an idea off of everyone. One thing I notice is that 
there are certain bugs that show up when using the parquet-cli that don't show 
up when using it as an SDK in a Java program, even when reading the same files. 
There appears to be some duplicated code between the CLI and the rest of the 
SDK. One example I noticed is how there are changes that were made to detect 
UUIDs as a special case of a fixed length byte array, and while all the 
necessary changes are made in the SDK, some are missing from the duplicated 
code in the CLI. One thing we need to do is stop relying on the duplicated code 
in the CLI and have it exist ONLY as a thin wrapper around the SDK. And one way 
perhaps to force us to do that would be to maintain the CLI as a separate 
project. Of course, I haven't figured out what all these code inconsistencies 
are, so perhaps it'll turn out to be easy to just fix the CLI as it is, but the 
point is to adopt policies that make it harder to break some parts of ParquetMR 
when adding enhancements.

Thanks. 

Reply via email to