Re: HiveContext is Serialized?

ayan guha Wed, 26 Oct 2016 01:05:28 -0700

In your use case, your dedf need not to be a data frame. You could use
SC.textFile().collect.
Even better you can just read off a local file, as your file is very small,
unless you are planning to use yarn cluster mode.
On 26 Oct 2016 16:43, "Ajay Chander" <itsche...@gmail.com> wrote:


> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
>> This usage is fine, because you are only using the HiveContext locally on
>> the driver. It's applied in a function that's used on a Scala collection.
>>
>> You can't use the HiveContext or SparkContext in a distribution
>> operation. It has nothing to do with for loops.
>>
>> The fact that they're serializable is misleading. It's there, I believe,
>> because these objects may be inadvertently referenced in the closure of a
>> function that executes remotely, yet doesn't use the context. The closure
>> cleaner can't always remove this reference. The task would fail to
>> serialize even though it doesn't use the context. You will find these
>> objects serialize but then don't work if used remotely.
>>
>> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
>> IIRC.
>>
>> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I was thinking if I can use hiveContext inside foreach like below,
>>>
>>> object Test {
>>>   def main(args: Array[String]): Unit = {
>>>
>>>     val conf = new SparkConf()
>>>     val sc = new SparkContext(conf)
>>>     val hiveContext = new HiveContext(sc)
>>>
>>>     val dataElementsFile = args(0)
>>>     val deDF = 
>>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>>
>>>     def calculate(de: Row) {
>>>       val dataElement = de.getAs[String]("DataElement").trim
>>>       val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>>> TEST_DB.TEST_TABLE1 ")
>>>       df1.write.insertInto("TEST_DB.TEST_TABLE1")
>>>     }
>>>
>>>     deDF.collect().foreach(calculate)
>>>   }
>>> }
>>>
>>>
>>> I looked at 
>>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>>  and I see it is extending SqlContext which extends Logging with 
>>> Serializable.
>>>
>>> Can anyone tell me if this is the right way to use it ? Thanks for your 
>>> time.
>>>
>>> Regards,
>>>
>>> Ajay
>>>
>>>

Re: HiveContext is Serialized?

Reply via email to