https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks
it will be great, if something like hive.exec.reducers.bytes.per.reducer
could be implemented.
one idea is, get total size of all target blocks, then set number of
partitions
--
View this message in
Yep, that worked out. Does this solution have any performance implications
past all the work being done on (probably) 1 node?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html
Sent from the
It depends on the RDD in question exactly where the work will be done. I
believe that if you do a repartition(1) instead of a coalesce it will force
a shuffle so the work will be done distributed and then a single node will
read that shuffled data and write it out.
If you want to write to a
- Parquet - insertInto makes many files
It depends on the RDD in question exactly where the work will be done. I
believe that if you do a repartition(1) instead of a coalesce it will force a
shuffle so the work will be done distributed and then a single node will read
that shuffled data and write