Re: Reading Hive tables Parallel in Spark

Matteo Cossu Tue, 18 Jul 2017 05:50:52 -0700

The context you use for calling SparkSQL can be used only in the driver.
Moreover, collect() works because it takes in local memory the RDD, but it
should be used only for debugging reasons(95% of the times), if all your
data fits into a single machine memory you shouldn't use Spark at all but
some normal database.
For your problem, if you still want to use SparkSQL, just use the threads,
instead if you want to use parallelize() or foreach you should avoid
calling stuff that needs to remain in the driver.


On 17 July 2017 at 17:46, Fretz Nuson <nuson.fr...@gmail.com> wrote:

> I was getting NullPointerException when trying to call SparkSQL from
> foreach. After debugging, i got to know spark session is not available in
> executor and could not successfully pass it.
>
> What i am doing is  tablesRDD.foreach.collect() and it works but goes
> sequential
>
> On Mon, Jul 17, 2017 at 5:58 PM, Simon Kitching <
> simon.kitch...@unbelievable-machine.com> wrote:
>
>> Have you tried simply making a list with your tables in it, then using
>> SparkContext.makeRDD(Seq)? ie
>>
>> val tablenames = List("table1", "table2", "table3", ...)
>> val tablesRDD = sc.makeRDD(tablenames, nParallelTasks)
>> tablesRDD.foreach(....)
>>
>> > Am 17.07.2017 um 14:12 schrieb FN <nuson.fr...@gmail.com>:
>> >
>> > Hi
>> > I am currently trying to parallelize reading multiple tables from Hive
>> . As
>> > part of an archival framework, i need to convert few hundred tables
>> which
>> > are in txt format to Parquet. For now i am calling a Spark SQL inside a
>> for
>> > loop for conversion. But this execute sequential and entire process
>> takes
>> > longer time to finish.
>> >
>> > I tired  submitting 4 different Spark jobs ( each having set of tables
>> to be
>> > converted), it did give me some parallelism , but i would like to do
>> this in
>> > single Spark job due to few limitation of our cluster and process
>> >
>> > Any help will be greatly appreciated
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>
>

Re: Reading Hive tables Parallel in Spark

Reply via email to