unit testing in spark

2016-12-08 Thread pseudo oduesp
somone can tell me how i can make unit test on pyspark ?
(book, tutorial ...)


create new spark context from ipython or jupyter

2016-12-07 Thread pseudo oduesp
Hi,
how we can create new sparkcontext from Ipython or jupyter session
i mean if i use current sparkcontext and i run sc.stop()

how i can launch new one from ipython without restart newsession of ipython
by refreshing browser ??

why i code some functions and i figreout i forgot  something insde function
,but that function i add it by two way :


   i simply add the module by
  sys.path.append(pathofmodule)
 or by sc.addpyFile from (if i apply on each workers )

somone can explain me how i can make unit test in pyspark ?
i should make it in local or in cluster and how i can do that in pyspark ?

thank you for advance .


add jars like spark-csv to ipython notebook with pyspakr

2016-09-09 Thread pseudo oduesp
Hi ,
how i can add jar to Ipython notebooke
i tied Pyspark_submit_args without succes ?
thanks


pyspakr 1.5.0 boradcast join

2016-09-08 Thread pseudo oduesp
hi ,
 some one can show me an example  for  broadcast join in this version
1.5.0  with data frame in pyspark
thanks


long lineage

2016-08-16 Thread pseudo oduesp
Hi ,
 how we can deal after raise stackoverflow trigger by long lineage ?
i mean i have this error and how resolve it wiyhout creating new session
thanks


java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-16 Thread pseudo oduesp
hi,
i cretae new columns with udf  after i try to filter this columns :
i get this error why ?

: java.lang.UnsupportedOperationException: Cannot evaluate expression:
fun_nm(input[0, string, true])
at
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:221)
at
org.apache.spark.sql.execution.python.PythonUDF.eval(PythonUDF.scala:27)
at
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:408)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$canFilterOutNull(Optimizer.scala:1234)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$55.apply(Optimizer.scala:1248)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$55.apply(Optimizer.scala:1248)
at
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
at scala.collection.immutable.List.exists(List.scala:84)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.org$apache$spark$sql$catalyst$optimizer$EliminateOuterJoin$$buildNewJoinType(Optimizer.scala:1248)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$30.applyOrElse(Optimizer.scala:1264)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$$anonfun$apply$30.applyOrElse(Optimizer.scala:1262)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.apply(Optimizer.scala:1262)
at
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin$.apply(Optimizer.scala:1225)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:74)
at
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:74)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
at

zip for pyspark

2016-08-08 Thread pseudo oduesp
hi,
how i can export all project on pyspark like zip   from local session to
cluster and deploy with spark submit  i mean i have a large project with
all dependances and i want create zip containing all of dependecs and
deploy it on cluster


Re: pyspark on pycharm on WINDOWS

2016-08-05 Thread pseudo oduesp
eam.write(Unknown Source)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:674)
at
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:494)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:504)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:504)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:441)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Connection reset by peer: socket write
error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
at java.io.BufferedOutputStream.write(Unknown Source)
at java.io.DataOutputStream.write(Unknown Source)
at java.io.FilterOutputStream.write(Unknown Source)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:674)
at
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:494)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:504)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:504)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)


Process finished with exit code 1


2016-08-05 15:35 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>:

> HI,
>
> i configured th pycharm like describe on stack overflow with spark_home
> and hadoop_conf_dir and donwload winutils to

pyspark on pycharm on WINDOWS

2016-08-05 Thread pseudo oduesp
HI,

i configured th pycharm like describe on stack overflow with spark_home and
hadoop_conf_dir and donwload winutils to use it with prebuild version of
spark 2.0  (pyspark 2.0)

and i get this error i f you can help me to find  solution thanks

C:\Users\AppData\Local\Continuum\Anaconda2\python.exe
C:/workspacecode/pyspark/pyspark/churn/test.py --master local[*]
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/05 15:32:33 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
16/08/05 15:32:35 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.
Traceback (most recent call last):
  File "C:/workspacecode/pyspark/pyspark/churn/test.py", line 11, in

print rdd.first()
  File "C:\spark-2.0.0-bin-hadoop2.6\python\pyspark\rdd.py", line 1328, in
first
rs = self.take(1)
  File "C:\spark-2.0.0-bin-hadoop2.6\python\pyspark\rdd.py", line 1280, in
take
totalParts = self.getNumPartitions()
  File "C:\spark-2.0.0-bin-hadoop2.6\python\pyspark\rdd.py", line 356, in
getNumPartitions
return self._jrdd.partitions().size()
  File
"C:\spark-2.0.0-bin-hadoop2.6\python\lib\py4j-0.10.1-src.zip\py4j\java_gateway.py",
line 933, in __call__
  File
"C:\spark-2.0.0-bin-hadoop2.6\python\lib\py4j-0.10.1-src.zip\py4j\protocol.py",
line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/C:workspacecode/rapexp1412.csv
at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:60)
at
org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.T


WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp
hi ,
with pyspark 2.0  i get this errors

WindowsError: [Error 2] The system cannot find the file specified

someone can help me to find solution
thanks


Re: WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp
C:\Users\AppData\Local\Continuum\Anaconda2\python.exe
C:/workspacecode/pyspark/pyspark/churn/test.py
Traceback (most recent call last):
  File "C:/workspacecode/pyspark/pyspark/churn/test.py", line 5, in 
conf = SparkConf()
  File
"C:\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\python\pyspark\conf.py",
line 104, in __init__
SparkContext._ensure_initialized()
  File
"C:\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\python\pyspark\context.py",
line 243, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File
"C:\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\spark-2.0.0-bin-hadoop2.6\python\pyspark\java_gateway.py",
line 79, in launch_gateway
proc = Popen(command, stdin=PIPE, env=env)
  File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\subprocess.py", line
711, in __init__
errread, errwrite)
  File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\subprocess.py", line
959, in _execute_child
startupinfo)
WindowsError: [Error 2] Le fichier sp�cifi� est introuvable

Process finished with exit code 1

2016-08-04 16:01 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>:

> hi ,
> with pyspark 2.0  i get this errors
>
> WindowsError: [Error 2] The system cannot find the file specified
>
> someone can help me to find solution
> thanks
>
>


pycharm and pyspark on windows

2016-08-04 Thread pseudo oduesp
Hi ,
 what is good conf for pyspark and pycharm on windwos ?

tahnks


decribe function limit of columns

2016-08-02 Thread pseudo oduesp
Hi
 in spark 1.5.0  i used  descibe function with more than 100 columns .
someone can tell me if any limit exsiste now ?

thanks


Re: java.net.UnknownHostException

2016-08-02 Thread pseudo oduesp
someone can help me please 

2016-08-01 11:51 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>:

> hi
> i get the following erreors when i try using pyspark 2.0 with ipython   on
> yarn
> somone can help me please .
> java.lang.IllegalArgumentException: java.net.UnknownHostException:
> s001.bigdata.;s003.bigdata;s008bigdata.
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:823)
> at
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:779)
> at
> org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:133)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:130)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:94)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:130)
> at
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:367)
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
> at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
> at org.apache.spark.SparkContext.(SparkContext.scala:500)
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> java.net.UnknownHostException:s001.bigdata.;s003.bigdata;s008bigdata.
>
>
> thanks
>


java.net.UnknownHostException

2016-08-01 Thread pseudo oduesp
hi
i get the following erreors when i try using pyspark 2.0 with ipython   on
yarn
somone can help me please .
java.lang.IllegalArgumentException: java.net.UnknownHostException:
s001.bigdata.;s003.bigdata;s008bigdata.
at
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at
org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:823)
at
org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:779)
at
org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)
at
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:133)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:130)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:94)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:130)
at
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:367)
at
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.(SparkContext.scala:500)
at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by:
java.net.UnknownHostException:s001.bigdata.;s003.bigdata;s008bigdata.


thanks


estimation of necessary time of execution

2016-07-29 Thread pseudo oduesp
Hi,

on hive we have a awosome function for estimation of time of execution
before launch ?

in spark can find any function to estimate the time of lineage of spark dag
execution ?

Thanks


sparse vector to dense vecotor in pyspark

2016-07-26 Thread pseudo oduesp
Hi ,
with standerscaler we get a sparse vector how i can transform it to list or
dense vector without missing the sparse values

thanks


Re: PCA machine learning

2016-07-26 Thread pseudo oduesp
hi ,
i want add somme point
getting the follow tow vectors first on it s  features vectors =
Row(features=SparseVector(765, {0: 3.0, 1: 1.0, 2: 50.0, 3: 16.0, 5:
88021.0, 6: 88021.0, 8: 1.0, 11: 1.0, 12: 200.0, 14: 200.0, 15: 200.0, 16:
200.0, 17: 2.0, 18: 1.0, 25: 1.0, 26: 2.0, 31: 89200.0, 32: 65.0, 33: 1.0,
34: 89020044.0, 35: 1.0, 36: 1.0, 42: 4.0, 43: 24.0, 44: 2274.0, 45: 54.0,
46: 34.0, 47: 44.0, 48: 2654.0, 49: 2934.0, 50: 84.0, 56: 3404.0, 57: 16.0,
59: 1.0, 70: 1.0, 75: 1.0, 76: 1.0, 77: 1.0, 78: 1.0, 79: 1.0, 80: 1.0, 81:
1.0, 82: 1.0, 83: 1.0, 84: 1.0, 85: 1.0, 86: 1.0, 87: 1.0, 88: 1.0, 89:
1.0, 91: 1.0, 92: 1.0, 93: 1.0, 94: 1.0, 95: 1.0, 96: 1.0, 97: 1.0, 98:
1.0, 99: 1.0, 100: 1.0, 102: 1.0, 137: 1.0, 139: 1.0, 141: 1.0, 150: 1.0,
155: 1.0, 158: 1.0, 160: 1.0, 259: 0.61, 260: 0.61, 261: 0.61, 262: 0.61,
263: 1.0, 264: 0.61, 265: 0.61, 266: 0.61, 267: 0.61, 268: 1.0, 269: 0.61,
270: 0.61, 271: 0.61, 272: 0.61, 273: 1.0, 274: 0.61, 275: 0.61, 276: 0.61,
277: 0.61, 278: 1.0, 281: 916.57, 282: 916.57, 283: 916.57, 284: 865.43,
285: 865.43, 286: 865.43, 287: 816.19, 288: 816.19, 289: 816.19, 290:
760.53, 291: 760.53, 292: 760.53, 293: 874.9, 294: 874.9, 295: 874.9, 296:
963.89, 297: 172.9, 298: 73.64, 299: 1.87, 300: 349.53, 301: 109.95, 302:
116.67, 303: 38.59, 304: 68.28, 305: 2.23, 313: 1.0, 314: 1.0, 315: 1.0,
316: 1.0, 317: 1.0, 318: 1.0, 319: 1.0, 320: 1.0, 321: 1.0, 322: 1.0, 323:
109.95, 324: 172.9, 325: 116.67, 326: 38.59, 327: 2.23, 328: 73.64, 329:
1.87, 330: 349.53, 331: 68.28, 332: 180.46, 333: 933.66, 334: 916.57, 335:
1.0, 336: 1.0, 337: 1.0, 338: 1.0, 339: 1.0, 340: 1.0, 341: 1.0, 342: 1.0,
343: 1.0, 344: 166.231, 345: 323.713, 346: 104.988, 347: 104.988, 348:
34.996, 350: 69.992, 352: 61.243, 353: 166.231, 354: 323.713, 355: 104.988,
356: 104.988, 357: 34.996, 359: 69.992, 361: 61.243, 364: 1.0, 365: 1.0,
366: 1.0, 367: 1.0, 368: 1.0, 369: 1.0, 370: 1.0, 371: 1.0, 372: 1.0, 373:
144.5007, 374: 281.3961, 375: 91.2636, 376: 91.2636, 377: 30.4212, 379:
60.8424, 381: 53.2371, 382: 144.5007, 383: 281.3961, 384: 91.2636, 385:
91.2636, 386: 30.4212, 388: 60.8424, 390: 53.2371, 393: 1.0, 394: 1.0, 395:
1.0, 396: 1.0, 397: 1.0, 398: 1.0, 399: 1.0, 400: 1.0, 401: 1.0, 402:
155.0761, 403: 301.9903, 404: 97.9428, 405: 97.9428, 406: 32.6476, 408:
65.2952, 410: 57.1333, 411: 155.0761, 412: 301.9903, 413: 97.9428, 414:
97.9428, 415: 32.6476, 417: 65.2952, 419: 57.1333, 422: 1.0, 423: 1.0, 424:
1.0, 425: 1.0, 426: 1.0, 427: 1.0, 428: 1.0, 429: 1.0, 430: 1.0, 431:
164.4317, 432: 320.2091, 433: 103.8516, 434: 103.8516, 435: 34.6172, 437:
69.2344, 439: 60.5801, 440: 164.4317, 441: 320.2091, 442: 103.8516, 443:
103.8516, 444: 34.6172, 446: 69.2344, 448: 60.5801, 451: 1.0, 452: 1.0,
453: 1.0, 454: 1.0, 455: 1.0, 456: 1.0, 457: 1.0, 458: 1.0, 459: 1.0, 460:
174.1483, 461: 339.1309, 462: 109.9884, 463: 109.9884, 464: 36.6628, 466:
73.3256, 468: 64.1599, 469: 174.1483, 470: 339.1309, 471: 109.9884, 472:
109.9884, 473: 36.6628, 475: 73.3256, 477: 64.1599, 480: 0.0001, 481:
0.0001, 482: 0.0001, 483: 0.0001, 484: 0.0001, 485: 0.0001, 486: 0.0001,
487: 0.0001, 488: 172.9, 489: 172.9, 490: 172.9, 491: 172.9, 492: 283.4426,
493: 283.4426, 494: 283.4426, 495: 283.4426, 504: 73.64, 505: 73.64, 506:
73.64, 507: 73.64, 508: 1207213.1148, 509: 1207213.1148, 510: 1207213.1148,
511: 1207213.1148, 520: 1.87, 521: 1.87, 522: 1.87, 523: 1.87, 524:
30655.7377, 525: 30655.7377, 526: 30655.7377, 527: 30655.7377, 536: 349.53,
537: 349.53, 538: 349.53, 539: 349.53, 540: 573.0, 541: 573.0, 542: 573.0,
543: 573.0, 552: 116.67, 553: 116.67, 554: 116.67, 555: 116.67, 556:
191.2623, 557: 191.2623, 558: 191.2623, 559: 191.2623, 568: 38.59, 569:
38.59, 570: 38.59, 571: 38.59, 572: 38.59, 573: 38.59, 574: 38.59, 575:
38.59, 584: 180.46, 585: 180.46, 586: 180.46, 587: 180.46, 588: 295.8361,
589: 295.8361, 590: 295.8361, 591: 295.8361, 600: 933.66, 601: 933.66, 602:
933.66, 603: 933.66, 604: 1239250.9834, 605: 1239250.9834, 606:
1239250.9834, 607: 1239250.9834, 643: 170.0, 644: 170.0, 646: 170.0, 648:
170.0, 658: 170.0, 662: 170.0, 665: 170.0, 667: 170.0, 758: 0.224, 763:
0.224}),


and second one it's projection on 20 principal component anlaysis  :


pca_features=DenseVector([89036409.0534, 2986242.0691, 227234.8184,
108796.4282, -129553.463, 89983.1029, 223420.7277, 53740.2034,
-113602.7292, -20057.1001, 33872.3162, -759.2689, 410., -872.6325,
-4896.6554, 4060.5014, -786.3297, -951.3851, 68464.2515, 3850.9394,
876.7108, 98.5793, 21342.2015, 863.9765, 1456.3933, -265.2494, 85325.4192,
-3657.0752, 111.7979, -59.6176, -945.8667, -84.1924, 246.233, -636.8786,
-749.1798, 900.8763, -177.4543, -105.4379, 272.7857, -535.0951]))]


when i create  the vector from orginal data frame i had order of my columns
like that i can associete for each  value in feature the name of variable .

how i can  identify names of principal component in second vector ?






2016-07-26 10:39 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>:

> H

PCA machine learning

2016-07-26 Thread pseudo oduesp
Hi,
when i perform PCA reduction dimension i get dense vector with length of
number of principla component  my question :

 -How i get the name of features giving this vectors ?
 -the  values inside vectors result its  value of projection of all
features  on this componenets ?
- how to use it ?

thanks


Re: add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread pseudo oduesp
PYSPARK_SUBMIT_ARGS  =  --jars spark-csv_2.10-1.4.0.jar,commons-csv-1.1.jar
without succecs

thanks


2016-07-25 13:27 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>:

> Hi ,
>  someone can telle me how i can add jars to ipython  i try spark
>
>
>


add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread pseudo oduesp
Hi ,
 someone can telle me how i can add jars to ipython  i try spark


spark and plot data

2016-07-21 Thread pseudo oduesp
Hi ,
i know spark  it s engine  to compute large data set but for me i work with
pyspark and it s very wonderful machine

my question  we  don't have tools for ploting data each time we have to
switch and go back to python for using plot.
but when you have large result scatter plot or roc curve  you cant use
collect to take data .

somone have propostion for plot .

thanks


RandomForestClassifier

2016-07-20 Thread pseudo oduesp
hi ,
 we have parmaters named


labelCol="labe"


,featuresCol="features",

when i precise the value here (label and features)  if  train my model on
data frame with other columns  tha algorithme choos only label columns and
features columns ?

thanks


lift coefficien

2016-07-20 Thread pseudo oduesp
Hi ,
how we can claculate lift coeff  from pyspark result of prediction ?

thanks ?


which one spark ml or spark mllib

2016-07-19 Thread pseudo oduesp
HI,

i don't have any idea why we have to library  ML and mlib

ml you can use it with data frame and mllib with rdd but ml have some lakes
like:
save model most important if you want create web api with score

my question why we don't have all features in MLlib on ML ?


( i use pyspark 1.5.0  because entreise restrict  of all a)


pyspark 1.5 0 save model ?

2016-07-18 Thread pseudo oduesp
Hi,
how i can save model under pyspakr 1.5.0  ?
 i use RandomForestClassifier()
thanks in advance.


Feature importance IN random forest

2016-07-12 Thread pseudo oduesp
Hi,
 i use pyspark 1.5.0
can i  ask you how i can get feature imprtance for a randomforest
algorithme in pyspark and please give me example
thanks for advance.


categoricalFeaturesInfo

2016-07-07 Thread pseudo oduesp
Hi,
 how i can use this option in Random Forest .

when i transform my vector (100 features ) i have 20 categoriel feature
include.

if i understand categorielFeatureinfo , i should past the position of my 20
categoriels feature inside of the vector containing 100 with map{
positionof feature inside vectorassmble: numbre of arrangements} ok

but how  i can find the right position giving vectorassmbler result (order
 change inside vector) ?

 LabledPoint(0.1,Vector.dense(0,1,2,.,20,...30,.100})


thanks


remove row from data frame

2016-07-05 Thread pseudo oduesp
Hi ,
how i can remove row from data frame  verifying some condition on some
columns ?
thanks


alter table with hive context

2016-06-26 Thread pseudo oduesp
Hi,
how i can alter table by adiing new columns  to table in hivecontext ?


add multiple columns

2016-06-26 Thread pseudo oduesp
Hi who i can add multiple columns to data frame

withcolumns allow to add one columns but when you  have multiple  i have to
loop on eache columns ?

thanks


Re: categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp
i want add informations
when i created this dict i fllow this step :
 1-  i create list of all my variable :
   liste double liste int liste categoriel variable
all categoriel variable it s int  typed
2   i create al = listdouble+listint+listcateg :
 command
list(itertools.chain(fdouble,fint,f_index))
like that i keep order of variable  in this order i have all f_index  from
517: to 824
but when i create lable point i lose this order  and i lose type int .



2016-06-24 9:40 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>:

> Hi,
> how i can keep type of my variable like int
> because i  get this error when i call random forest algorithm with
>
> model = RandomForest.trainClassifier(rdf,
> numClasses=2,
> categoricalFeaturesInfo=d,
>  numTrees=3,
>  featureSubsetStrategy="auto",
>  impurity='gini',
>  maxDepth=4,
>  maxBins=42)
>
> where d it s dict containing  map of my categoriel feautures (i have
> aleady stirng indexed and transform to int )
>
>
> n error occurred while calling o20271.trainRandomForestModel.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 29 in stage 1898.0 failed 4 times, most recent failure: Lost task 29.3 in
> stage 1898.0 (TID 748788, prbigdata1s013.bigplay.bigdata.intraxa):
> java.lang.IllegalArgumentException: DecisionTree given invalid data:
> Feature 517 is categorical with values in {0,...,16, but a data point gives
> it value 48940.0.
>   Bad data point:
> (1.0,(825,[0,1,2,4,8,17,19,21,27,31,32,50,52,56,57,75,77,78,79,80,83,89,96,97,98,99,101,103,104,105,108,114,121,122,123,124,126,128,129,130,132,133,134,135,136,138,139,140,141,142,156,157,160,161,163,164,165,166,167,181,182,185,186,187,190,191,202,203,204,205,206,207,208,209,210,213,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,238,245,246,247,248,249,250,251,260,262,263,264,265,266,269,270,271,272,273,275,276,277,278,279,280,281,282,283,284,293,294,295,298,308,309,312,328,350,368,371,379,384,385,388,389,390,391,392,393,394,395,396,397,398,402,403,404,405,406,407,408,409,410,411,412,416,417,418,419,420,421,422,423,424,425,426,428,429,430,431,432,433,434,435,436,437,438,439,440,447,448,449,450,451,452,453,454,455,456,457,460,464,465,466,470,473,477,481,482,483,484,485,486,487,488,489,490,491,492,493,496,497,498,499,500,501,502,503,504,505,506,507,508,511,512,513,514,515,516,517,518,519,520,521,522,523,526,527,528,529,530,531,532,533,534,535,536,537,538,541,542,550,554,556,562,564,565,566,567,568,569,570,571,572,573,574,575,576,644,646,647,648,649,651,654,655,656,657,663,664,666,667,668,669,670,671,672,673,675,677,678,679,680,681,682,683,684,685,687,688,689,690,691,692,693,694,695,696,697,698,699,700,704,709,710,711,712,713,714,715,716,717,718,729,734,735,737,738,739,740,741,742,743,744,745,747,748,749,750,751,752,753,754,755,756,758,760,761,764,765,766,767,768,769,774,776,777,779,780,781,782,783,784,786,787,788,789,790,791,793,794,796,797,798,799,800,801,802,803,804,805,808,809,810,811,814,816,817,824],[10.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,200.0,2000.0,2000.0,460.0,305.0,2000.0,2000.0,460.0,305.0,81.76,69.8,31.66,5.28,18.8,162.06,20.6,51.96,27.6,108.74,77.5,66.16,30.0,5.0,17.82,153.62,19.52,49.24,26.18,103.08,1.23456789E9,1.23456789E9,1.23456789E9,0.01,2.0,14.0,14.0,63.0,3.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,3.0,1.0,1.0,1.0,1.0,3.0,2.0,2.0,3.0,1.0,400.0,1.0,1.0,13.0,15.0,19.0,20.0,25.0,1.23456789E9,1.23456789E9,1.23456789E9,3.0,5.0,6.0,7.0,8.0,9.0,10.0,12.0,1.0,13.0,15.0,19.0,20.0,25.0,1.23456789E9,1.23456789E9,1.23456789E9,3.0,5.0,6.0,7.0,8.0,9.0,10.0,12.0,1210.0,8.0,121112.0,130912.0,28.0,1.0,17450.0,1.0,8.0,1.0,1.0,8508.0,8508.0,10550.0,1.0,8889.0,8426.0,8889.0,8426.0,8889.0,8426.0,8889.0,8426.0,2.0,1.0,100.0,100.0,4.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,4.0,5.0,3.0,4.0,10.0,11.0,12.0,10.0,10.0,7.0,4.0,4.0,4.0,3.0,4.0,10.0,11.0,12.0,10.0,10.0,7.0,4.0,4.0,4.0,3.0,4.0,10.0,11.0,13.0,10.0,9.0,7.0,5.0,1.0,1.0,4.0,4.0,3.0,4.0,10.0,11.0,14.0,10.0,9.0,7.0,5.0,4.0,5.0,3.0,4.0,10.0,11.0,12.0,10.0,10.0,7.0,4.0,2.0,1.0,1.0,1.0,15.0,1.0,1.0,38335.0,8815.0,78408.0,44160.0,37187.0,1079.0,51630.0,11873.0,17102.0,11839.0,10126.0,22676.0,7000.0,39303.0,9037.0,81842.0,48036.0,37187.0,1116.0,51630.0,11873.0,17102.0,11839.0,10126.0,22676.0,7000.0,40971.0,9422.0,80086.0,44257.0,37000.0,1064.0,48940.0,11255.0,16212.0,11224.0,9598.0,18600.0,7948.0,40971.0,9422.0,80086.0,44257.0,37000.0,1064.0,48940.0,11255.0,16212.0,11224.0,9598.0,18600.0,79

categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp
Hi,
how i can keep type of my variable like int
because i  get this error when i call random forest algorithm with

model = RandomForest.trainClassifier(rdf,
numClasses=2,
categoricalFeaturesInfo=d,
 numTrees=3,
 featureSubsetStrategy="auto",
 impurity='gini',
 maxDepth=4,
 maxBins=42)

where d it s dict containing  map of my categoriel feautures (i have aleady
stirng indexed and transform to int )


n error occurred while calling o20271.trainRandomForestModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
29 in stage 1898.0 failed 4 times, most recent failure: Lost task 29.3 in
stage 1898.0 (TID 748788, prbigdata1s013.bigplay.bigdata.intraxa):
java.lang.IllegalArgumentException: DecisionTree given invalid data:
Feature 517 is categorical with values in {0,...,16, but a data point gives
it value 48940.0.
  Bad data point:

categoricalFeaturesInfo

2016-06-23 Thread pseudo oduesp
Hi,
i am pyspark user and i want test the Randoforest algrithmes.

i found  this parmeters categoricalFeaturesInfo how i can use it from list
of categoriels variables .
thanks.


feture importance or variable importance

2016-06-21 Thread pseudo oduesp
hi ,

 i am pyspark user and i want   to  extract var imprtance in randomforest
model  for plot

how i can deal with that ?

thanks


Labeledpoint

2016-06-21 Thread pseudo oduesp
Hi,
i am pyspark user and i want test Randomforest.

i have dataframe with 100 columns
i should give Rdd or data frame to algorithme i transformed my dataframe to
only tow columns
label ands features  columns

 df.label df.features
  0(517,(0,1,2,333,56 ...
   1   (517,(0,11,0,33,6 ...
0   (517,(0,1,0,33,8 ...

but i dont have no ieda to transorme data frame like input to data frame i
test the example in offciel web page without succes

please give me example how i can work and specily with test set  .

thanks


)

2016-06-21 Thread pseudo oduesp
hi,
help me please to resolve this issues




cast only some columns

2016-06-21 Thread pseudo oduesp
Hi ,
 with fillna we can select  some columns to perform replace some values
 with chosing columns with dict
{columns :values }
but how i can  do  same with cast i have data frame with 300 columns and i
want just cats 4 from list columns  but with select query like that :

df.select(columns1.cast(int),columns2.cast(int),columns3.cast(int))

but i loose other columns
how i can performe on some columns without loosing other ones ?

thanks


read.parquet or read.load

2016-06-21 Thread pseudo oduesp
hi ,

realy i m angry about parquet file each time i get error like
Could not read footer: java.lang.RuntimeException:
or error occuring when o127.load

why we have à lot of issuse with this format ?

thanks


Unable to acquire bytes of memory

2016-06-20 Thread pseudo oduesp
Hi ,
i don t have no idea why i get this error



Py4JJavaError: An error occurred while calling o69143.parquet.
: org.apache.spark.SparkException: Job aborted.
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 8 in stage 10645.0 failed 4 times, most recent failure: Lost
task 8.3 in stage 10645.0 (TID 536592,
prssnbd1s006.bigplay.bigdata.intraxa): java.io.IOException: Unable to
acquire 67108864 bytes of memory
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
at
org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
at org.apache.spark.sql.execution.TungstenSort.org
$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
at
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:59)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at

plot importante variable in pyspark

2016-06-19 Thread pseudo oduesp
hi,

who can get score for each row of classification algortithmes , and how  i
can plot features importance   of variables like sickit learn ?

thanks.


binding two data frame

2016-06-17 Thread pseudo oduesp
Hi,
in R we have function named Cbind and rbind  for data frame

how i can repduce this functions on pyspark

df1.col1 df1.col2 df1.col3
 df2.col1 df2.col2 df2.col3


fincal result   :
new data frame
df1.col1 df1.col2 df1.col3   df2.col1 df2.col2 df2.col3

thanks


update data frame inside function

2016-06-17 Thread pseudo oduesp
Hi,
how i can update data frame inside function ?

why ?

i have to apply Stingindexer multiple time because i tried  Pipeline  but
it still extremly slow
for 84 columns to Stringindexed eache one have 10 modalities and data frame
with 21Milion row
i need 15 hours of processing .

now i want try  one by one to see  difference if you have other suggestion
your a welcome ?

thanks


Stringindexers on multiple columns >1000

2016-06-17 Thread pseudo oduesp
Hi,
i want  aplly string indexers  on multiple coluns but when use
Stringindexer and pipline that take lang time .

Indexer = StringIndexer(inputCol="Feature1", outputCol="indexed1")

this it practice for one or two or teen lines but when you have more
the 1000  lines how you can do ?

thanks


difference between dataframe and dataframwrite

2016-06-16 Thread pseudo oduesp
hi,

what is difference between dataframe and dataframwrite ?


Re: advise please

2016-06-16 Thread pseudo oduesp
hi ,
 i use pyspark 1.5.0 on yarn cluster with 19 nodes and 200 GO
and 4 cores eache (include  driver)

2016-06-16 15:42 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>:

> Hi ,
> who i can dummies large set of columns with  STRINGindexer fast ?
> becasue i tested with 89 values and eache one had 10 max distinct  values
> and that take
> lot of time
> thanks
>


advise please

2016-06-16 Thread pseudo oduesp
Hi ,
who i can dummies large set of columns with  STRINGindexer fast ?
becasue i tested with 89 values and eache one had 10 max distinct  values
and that take
lot of time
thanks


cache datframe

2016-06-16 Thread pseudo oduesp
hi,
if i cache same data frame and transforme and add collumns i should cache
second times

df.cache()

  transforamtion
  add new columns

df.cache()
?


String indexer

2016-06-16 Thread pseudo oduesp
hi ,
what is limite of  modalties in Stingindexer :

if i have columns with 1000 modalities it good to use STRINGindexers ?
or i should try other function and which one please ?

thanks


STringindexer

2016-06-16 Thread pseudo oduesp
Hi ,
i have dataframe with 1000 columns to dummies with stingIndexer
when i apply pipliene take  long times whene i want merge result with other
data frame

i mean  :
 originnal data frame + columns indexed by STringindexers

PB save stage it s long  why ?

code

 indexers  = [StringIndexer(inputCol=i, outputCol=i+"_index").fit(df)
for i in l]
 li = [i+"_index" for i in l]
 pipeline = Pipeline(stages=indexers)
 df_r = pipeline.fit(df).transform(df)
 df_r = df_r.repartition(500)
 df_r.persist()
 df_r.write().parquet(paths)


vecotors inside columns

2016-06-15 Thread pseudo oduesp
hi ,
i want ask question about vector.dense or spars :

imagine i have dataframe with columns   and one of them contain vectors .

my question can i give this columns to machine learning algorithmes like
one value ?


df.col1 | df.col2 |
1 | (1,[2],[3] ,[] ...[6])
2 | (1,[5],[3] ,[] ...[9])
3 | (1,[5],[3] ,[] ...[10])

thanks


MAtcheERROR : STRINGTYPE

2016-06-14 Thread pseudo oduesp
hello

why i get this error

when  using

assembleur =  VectorAssembler(  inputCols=l_CDMVT,
outputCol="aev"+"CODEM")
output = assembler.transform(df_aev)

L_CDMTV list of columns


thanks  ?


data frame or RDD for machine learning

2016-06-09 Thread pseudo oduesp
Hi,
after spark 1.3 we have dataframe ( thanks good )  ,  instead rdd  :

 in machine learning algorithmes we should  give him an RDD or dataframe?

i mean when i build modele :


   Model  = algoritme(rdd)
or
   Model =  algorithme(df)



if you have an  exemple with data frame i prefer work with it.

thanks .


oozie and spark on yarn

2016-06-08 Thread pseudo oduesp
hi ,

i want ask if somone used oozie with spark ?

if you can give me example:
how ? we can configure  on yarn
thanks


np.unique and collect

2016-06-03 Thread pseudo oduesp
Hi ,
why np.unique  return list instead of list in this function ?

def unique_item_df(df,list_var):

l = df.select(list_var).distinct().collect()
return np.unique(l)


df it s data frmae and list it lits of variables .

(pyspark) code
thanks
.


hivecontext and date format

2016-06-01 Thread pseudo oduesp
Hi ,
can i ask you how we can convert string like dd/mm/ to date type in
hivecontext?

i try with unix_timestemp and with format date but i get null .
thank you.


equvalent beewn join sql and data frame

2016-05-30 Thread pseudo oduesp
hi guys ,
it s similare  thing to do :

sqlcontext.join("select * from t1 join t2 on condition) and

df1.join(df2,condition,'inner")??

ps:df1.registertable('t1')
ps:df2.registertable('t2')
thanks


never understand

2016-05-25 Thread pseudo oduesp
hi guys ,
-i get this errors with pyspark 1.5.0 under cloudera CDH 5.5 (yarn)

-i use yarn to deploy job on cluster.
-i use hive context  and parquet file to save my data.
limit container 16 GB
number of executor i tested befor it s 12 GB (executor memory)
-i tested  to increase number of partitions (by default it s 200) i
multipie by 2 and 3  whitout succes.

-I try to change number of sql partitins shuffle


-i remarque in spark UI when (shuffle write it triggerd no problem) but
(when shuffle read triggerd i lost executors and get erros)



and realy blocked by this error  where she came from




 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread
Thread[Executor task launch worker-5,5,main]
java.lang.OutOfMemoryError: Java heap space
at
parquet.column.values.dictionary.IntList.initSlab(IntList.java:90)
at parquet.column.values.dictionary.IntList.(IntList.java:86)
at
parquet.column.values.dictionary.DictionaryValuesWriter.(DictionaryValuesWriter.java:93)
at
parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.(DictionaryValuesWriter.java:229)
at
parquet.column.ParquetProperties.dictionaryWriter(ParquetProperties.java:131)
at
parquet.column.ParquetProperties.dictWriterWithFallBack(ParquetProperties.java:178)
at
parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:203)
at parquet.column.impl.ColumnWriterV1.(ColumnWriterV1.java:84)
at
parquet.column.impl.ColumnWriteStoreV1.newMemColumn(ColumnWriteStoreV1.java:68)
at
parquet.column.impl.ColumnWriteStoreV1.getColumnWriter(ColumnWriteStoreV1.java:56)
at
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.(MessageColumnIO.java:207)
at
parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:405)
at
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
at
parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:97)
at
parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:100)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:326)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
 at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:405)
at
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:107)
at
parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:97)
at
parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:100)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:326)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/05/25 09:54:42 ERROR util.SparkUncaughtExceptionHandler: Uncaught
exception in thread Thread[Executor task launch worker-6,5,main]
java.lang.OutOfMemoryError: Java heap space
at
parquet.column.values.dictionary.IntList.initSlab(IntList.java:90)
at parquet.column.values.dictionary.IntList.(IntList.java:86)
at
parquet.column.values.dictionary.DictionaryValuesWriter.(DictionaryValuesWriter.java:93)
at
parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.(DictionaryValuesWriter.java:229)
at
parquet.column.ParquetProperties.dictionaryWriter(ParquetProperties.java:131)
at

orgin of error

2016-05-15 Thread pseudo oduesp
someone can help me about this issues



py4j.protocol.Py4JJavaError: An error occurred while calling o126.parquet.
: org.apache.spark.SparkException: Job aborted.
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 69 in stage 19.0 failed 4 times, most recent failure: Lost
task 69.3 in stage 19.0 (TID 3788, prssnbd1s003.bigplay.bigdata.intraxa):
ExecutorLostFailure (executor 4 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)
... 28 more


Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet file (too small)

2016-04-13 Thread pseudo oduesp
hi guys ,
i have this error after 5 hours of processing i make lot of joins 14 left
joins
with small table :



 i saw in the spark ui  and console log evrithing ok but when he save
last join i get this error

Py4JJavaError: An error occurred while calling o115.parquet. _metadata is
not a Parquet file (too small)

i use 4 containers  26 go each and 8 cores i increase number of partition
and  i use broadcast join  whithout succes i get log file but he s large 57
mo i can't share with you .

i use pyspark 1.5.0 on cloudera 5.5.1 and yarn  and i use
hivecontext  for dealing with data.


multiple tables for join

2016-03-24 Thread pseudo oduesp
hi , i spent two months of my times to make 10 joins whith folowin tables :

   1go tbal1
   3go table 2
   500mo table 3
   400 mo table 4
20 mo table 5
   100 mo table 6
   30 mo table 7
   40 mo table 8
700 mo  table 9
800 mo table 10

i use hivecontext.sql("select * from table1  left join table 2  on
c1==c2")..

table 1 with 2000 columns
table 2 with 18500 columns
tables 3  with 10 columns
all other tbales under 10 columns

please help me .