Re: Guava dependency issue

2018-05-08 Thread Koert Kuipers
we shade guava in our fat jar/assembly jar/application jar

On Tue, May 8, 2018 at 12:31 PM, Marcelo Vanzin  wrote:

> Using a custom Guava version with Spark is not that simple. Spark
> shades Guava, but a lot of libraries Spark uses do not - the main one
> being all of the Hadoop ones, and they need a quite old Guava.
>
> So you have two options: shade/relocate Guava in your application, or
> use spark.{driver|executor}.userClassPath first.
>
> There really isn't anything easier until we get shaded Hadoop client
> libraries...
>
> On Tue, May 8, 2018 at 8:44 AM, Stephen Boesch  wrote:
> >
> > I downgraded to spark 2.0.1 and it fixed that particular runtime
> exception:
> > but then a similar one appears when saving to parquet:
> >
> > An  SOF question on this was created a month ago and today further
> details
> > plus an open bounty were added to it:
> >
> > https://stackoverflow.com/questions/49713485/spark-
> error-with-google-guava-library-java-lang-nosuchmethoderror-com-google-c
> >
> > The new but similar exception is shown below:
> >
> > The hack to downgrade to 2.0.1 does help - i.e. execution proceeds
> further :
> > but then when writing out to parquet the above error does happen.
> >
> > 8/05/07 11:26:11 ERROR Executor: Exception in task 0.0 in stage 2741.0
> (TID
> > 2618)
> > java.lang.NoSuchMethodError:
> > com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/
> CacheLoader;)Lcom/google/common/cache/LoadingCache;
> > at
> > org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
> > at org.apache.hadoop.io.compress.CodecPool.(CodecPool.
> java:74)
> > at
> > org.apache.parquet.hadoop.CodecFactory$BytesCompressor.<
> init>(CodecFactory.java:92)
> > at
> > org.apache.parquet.hadoop.CodecFactory.getCompressor(
> CodecFactory.java:169)
> > at
> > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
> ParquetOutputFormat.java:303)
> > at
> > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
> ParquetOutputFormat.java:262)
> > at
> > org.apache.spark.sql.execution.datasources.parquet.
> ParquetOutputWriter.(ParquetFileFormat.scala:562)
> > at
> > org.apache.spark.sql.execution.datasources.parquet.
> ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
> > at
> > org.apache.spark.sql.execution.datasources.BaseWriterContainer.
> newOutputWriter(WriterContainer.scala:131)
> > at
> > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.
> writeRows(WriterContainer.scala:247)
> > at
> > org.apache.spark.sql.execution.datasources.
> InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$
> apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> > at
> > org.apache.spark.sql.execution.datasources.
> InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$
> apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> scala:70)
> > at org.apache.spark.scheduler.Task.run(Task.scala:86)
> > at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:274)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:6
> >
> >
> >
> > 2018-05-07 10:30 GMT-07:00 Stephen Boesch :
> >>
> >> I am intermittently running into guava dependency issues across mutiple
> >> spark projects.  I have tried maven shade / relocate but it does not
> resolve
> >> the issues.
> >>
> >> The current project is extremely simple: *no* additional dependencies
> >> beyond scala, spark, and scalatest - yet the issues remain (and yes mvn
> >> clean was re-applied).
> >>
> >> Is there a reliable approach to handling the versioning for guava within
> >> spark dependency projects?
> >>
> >>
> >> [INFO]
> >> 
> 
> >> [INFO] Building ccapps_final 1.0-SNAPSHOT
> >> [INFO]
> >> 
> 
> >> [INFO]
> >> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ ccapps_final ---
> >> 18/05/07 10:24:00 WARN NativeCodeLoader: Unable to load native-hadoop
> >> library for your platform... using builtin-java classes where applicable
> >> [WARNING]
> >> java.lang.NoSuchMethodError:
> >> com.google.common.cache.CacheBuilder.refreshAfterWrite(JLjava/util/
> concurrent/TimeUnit;)Lcom/google/common/cache/CacheBuilder;
> >> at org.apache.hadoop.security.Groups.(Groups.java:96)
> >> at org.apache.hadoop.security.Groups.(Groups.java:73)
> >> at
> >> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(
> Groups.java:293)
> >> at
> >> org.apache.hadoop.security.UserGroupInformation.initialize(
> UserGroupInformation.java:283)
> >> at
> >> 

Re: Guava dependency issue

2018-05-08 Thread Marcelo Vanzin
Using a custom Guava version with Spark is not that simple. Spark
shades Guava, but a lot of libraries Spark uses do not - the main one
being all of the Hadoop ones, and they need a quite old Guava.

So you have two options: shade/relocate Guava in your application, or
use spark.{driver|executor}.userClassPath first.

There really isn't anything easier until we get shaded Hadoop client
libraries...

On Tue, May 8, 2018 at 8:44 AM, Stephen Boesch  wrote:
>
> I downgraded to spark 2.0.1 and it fixed that particular runtime exception:
> but then a similar one appears when saving to parquet:
>
> An  SOF question on this was created a month ago and today further details
> plus an open bounty were added to it:
>
> https://stackoverflow.com/questions/49713485/spark-error-with-google-guava-library-java-lang-nosuchmethoderror-com-google-c
>
> The new but similar exception is shown below:
>
> The hack to downgrade to 2.0.1 does help - i.e. execution proceeds further :
> but then when writing out to parquet the above error does happen.
>
> 8/05/07 11:26:11 ERROR Executor: Exception in task 0.0 in stage 2741.0 (TID
> 2618)
> java.lang.NoSuchMethodError:
> com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
> at
> org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
> at org.apache.hadoop.io.compress.CodecPool.(CodecPool.java:74)
> at
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.(CodecFactory.java:92)
> at
> org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:169)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:303)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
> at
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
>
>
>
> 2018-05-07 10:30 GMT-07:00 Stephen Boesch :
>>
>> I am intermittently running into guava dependency issues across mutiple
>> spark projects.  I have tried maven shade / relocate but it does not resolve
>> the issues.
>>
>> The current project is extremely simple: *no* additional dependencies
>> beyond scala, spark, and scalatest - yet the issues remain (and yes mvn
>> clean was re-applied).
>>
>> Is there a reliable approach to handling the versioning for guava within
>> spark dependency projects?
>>
>>
>> [INFO]
>> 
>> [INFO] Building ccapps_final 1.0-SNAPSHOT
>> [INFO]
>> 
>> [INFO]
>> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ ccapps_final ---
>> 18/05/07 10:24:00 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> [WARNING]
>> java.lang.NoSuchMethodError:
>> com.google.common.cache.CacheBuilder.refreshAfterWrite(JLjava/util/concurrent/TimeUnit;)Lcom/google/common/cache/CacheBuilder;
>> at org.apache.hadoop.security.Groups.(Groups.java:96)
>> at org.apache.hadoop.security.Groups.(Groups.java:73)
>> at
>> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
>> at
>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
>> at
>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
>> at
>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
>> at
>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
>> at
>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
>> at
>> 

Re: Guava dependency issue

2018-05-08 Thread Stephen Boesch
I downgraded to spark 2.0.1 and it fixed that *particular *runtime
exception: but then a similar one appears when saving to parquet:

An  SOF question on this was created a month ago and today further details plus
an open bounty were added to it:

https://stackoverflow.com/questions/49713485/spark-error-with-google-guava-library-java-lang-nosuchmethoderror-com-google-c

The new but similar exception is shown below:

The hack to downgrade to 2.0.1 does help - i.e. execution proceeds *further* :
but then when writing out to *parquet* the above error does happen.

8/05/07 11:26:11 ERROR Executor: Exception in task 0.0 in stage 2741.0
(TID 2618)
java.lang.NoSuchMethodError:
com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
at org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
at org.apache.hadoop.io.compress.CodecPool.(CodecPool.java:74)
at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.(CodecFactory.java:92)
at 
org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:169)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:303)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6



2018-05-07 10:30 GMT-07:00 Stephen Boesch :

> I am intermittently running into guava dependency issues across mutiple
> spark projects.  I have tried maven shade / relocate but it does not
> resolve the issues.
>
> The current project is extremely simple: *no* additional dependencies
> beyond scala, spark, and scalatest - yet the issues remain (and yes mvn
> clean was re-applied).
>
> Is there a reliable approach to handling the versioning for guava within
> spark dependency projects?
>
>
> [INFO] 
> 
> [INFO] Building ccapps_final 1.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ ccapps_final ---
> 18/05/07 10:24:00 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> [WARNING]
> java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.
> refreshAfterWrite(JLjava/util/concurrent/TimeUnit;)Lcom/
> google/common/cache/CacheBuilder;
> at org.apache.hadoop.security.Groups.(Groups.java:96)
> at org.apache.hadoop.security.Groups.(Groups.java:73)
> at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(
> Groups.java:293)
> at org.apache.hadoop.security.UserGroupInformation.initialize(
> UserGroupInformation.java:283)
> at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(
> UserGroupInformation.java:260)
> at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
> UserGroupInformation.java:789)
> at org.apache.hadoop.security.UserGroupInformation.getLoginUser(
> UserGroupInformation.java:774)
> at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(
> UserGroupInformation.java:647)
> at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.
> apply(Utils.scala:2424)
> at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.
> apply(Utils.scala:2424)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2424)
> at org.apache.spark.SparkContext.(SparkContext.scala:295)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 6.apply(SparkSession.scala:918)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 6.apply(SparkSession.scala:910)
> at