I am using Spark batch to process and ingest extracts of several RDBMS
tables/Filebased systems arriving in regular intervals into a Datalake as
ORC backed Hive tables. Considering that the input data file size, file
count, row count and feature counts vary quite a lot, I am unable to come
up with an optimal number for the coalesce.
I felt that the "alter table concatenate" is an easy way out to work around
the small files issue on NN that we are facing.
Sorry about the long story - I bumped into this issue earlier today
- Alter table concatenate is not working as expected (SPARK-20592
<https://issues.apache.org/jira/browse/SPARK-20592>). After some analysis
of the sql module, I found that the concatenate operation is consciously
marked as one of unsupportedHiveNativeCommands in the Antlr grammar.
Please let me know if you have strong reservations against enabling this? I
can take a stab at it and have a PR for review.