Re: Spark Job on YARN accessing Hbase Table

Prabhu Joseph Wed, 10 Feb 2016 03:15:14 -0800

Yes Ted, spark.executor.extraClassPath will work if hbase client jars is
present in all Spark Worker / NodeManager machines.


spark.yarn.dist.files is the easier way, as hbase client jars can be copied
from driver machine or hdfs into container / spark-executor classpath
automatically. No need to manually copy hbase client jars into
spark.executor.extraClassPath of all Worker / NodeManager nodes.

 spark.yarn.dist.files includes the jars from driver machine or hdfs into
container / spark executor classpath, but launch-container.sh does not
include the CWD/* of container into the classpath in hadoop-2.5.1 and hence
spark.yarn.dist.files does not work with hadoop-2.5.1,
spark.yarn.dist.files works fine on hadoop-2.7.0, as CWD/* is included in
container classpath through some bug fix. Searching for the JIRA.

Thanks,
Prabhu Joseph



On Wed, Feb 10, 2016 at 4:04 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Have you tried adding hbase client jars to spark.executor.extraClassPath ?
>
> Cheers
>
> On Wed, Feb 10, 2016 at 12:17 AM, Prabhu Joseph <
> prabhujose.ga...@gmail.com> wrote:
>
>> + Spark-Dev
>>
>> For a Spark job on YARN accessing hbase table, added all hbase client
>> jars into spark.yarn.dist.files, NodeManager when launching container i.e
>> executor, does localization and brings all hbase-client jars into executor
>> CWD, but still the executor tasks fail with ClassNotFoundException of hbase
>> client jars, when i checked launch container.sh , Classpath does not have
>> $PWD/* and hence all the hbase client jars are ignored.
>>
>> Is spark.yarn.dist.files not for adding jars into the executor classpath.
>>
>> Thanks,
>> Prabhu Joseph
>>
>> On Tue, Feb 9, 2016 at 1:42 PM, Prabhu Joseph <prabhujose.ga...@gmail.com
>> > wrote:
>>
>>> Hi All,
>>>
>>>  When i do count on a Hbase table from Spark Shell which runs as
>>> yarn-client mode, the job fails at count().
>>>
>>> MASTER=yarn-client ./spark-shell
>>>
>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor,
>>> TableName}
>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>
>>> val conf = HBaseConfiguration.create()
>>> conf.set(TableInputFormat.INPUT_TABLE,"spark")
>>>
>>> val hBaseRDD = sc.newAPIHadoopRDD(conf,
>>> classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
>>> hBaseRDD.count()
>>>
>>>
>>> Tasks throw below exception, the actual exception is swallowed, a bug
>>> JDK-7172206. After installing hbase client on all NodeManager machines, the
>>> Spark job ran fine. So I confirmed that the issue is with executor
>>> classpath.
>>>
>>> But i am searching for some other way of including hbase jars in spark
>>> executor classpath instead of installing hbase client on all NM machines.
>>> Tried adding all hbase jars in spark.yarn.dist.files , NM logs shows that
>>> it localized all hbase jars, still the job fails. Tried
>>> spark.executor.extraClasspath, still the job fails.
>>>
>>> Is there any way we can access hbase from Executor without installing
>>> hbase-client on all machines.
>>>
>>>
>>> 16/02/09 02:34:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
>>> 0, prabhuFS1): *java.lang.IllegalStateException: unread block data*
>>>         at
>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2428)
>>>         at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
>>>         at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>>>         at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>>>         at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>>         at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>>         at
>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>>         at
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
>>>         at
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
>>>         at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>         at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>
>>
>

Re: Spark Job on YARN accessing Hbase Table

Reply via email to