subject:"RE\: SchemaRDD \- Parquet \- insertInto makes many files"

RE: SchemaRDD - Parquet - insertInto makes many files

2014-09-08 Thread chutium

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks it will be great, if something like hive.exec.reducers.bytes.per.reducer could be implemented. one idea is, get total size of all target blocks, then set number of partitions -- View this message in

Re: SchemaRDD - Parquet - insertInto makes many files

2014-09-04 Thread DanteSama

Yep, that worked out. Does this solution have any performance implications past all the work being done on (probably) 1 node? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html Sent from the

Re: SchemaRDD - Parquet - insertInto makes many files

2014-09-04 Thread Michael Armbrust

It depends on the RDD in question exactly where the work will be done. I believe that if you do a repartition(1) instead of a coalesce it will force a shuffle so the work will be done distributed and then a single node will read that shuffled data and write it out. If you want to write to a

RE: SchemaRDD - Parquet - insertInto makes many files

2014-09-04 Thread Cheng, Hao

- Parquet - insertInto makes many files It depends on the RDD in question exactly where the work will be done. I believe that if you do a repartition(1) instead of a coalesce it will force a shuffle so the work will be done distributed and then a single node will read that shuffled data and write