SparkStraming job break with shuffle file not found
The spark streaming job running for a few days,then fail as below What is the possible reason? *18/03/25 07:58:37 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 80018.0 failed 4 times, most recent failure: Lost task 16.3 in stage 80018.0 (TID 7318859, 10.196.155.153): java.io.FileNotFoundException: /data/hadoop_tmp/nm-local-dir/usercache/mqq/appcache/application_1521712903594_6152/blockmgr-7aa2fb13-25d8-4145-a704-7861adfae4ec/22/shuffle_40009_16_0.data.574b45e8-bafd-437d-8fbf-deb6e3a1d001 (No such file or directory)* Thanks!
Wish you give our product a wonderful name
We have built an an ml platform, based on open source framework like hadoop, spark, tensorflow. Now we need to give our product a wonderful name, and eager for everyone's advice. Any answers will be greatly appreciated. Thanks.
How can i split dataset to multi dataset
val schema = StructType( Seq( StructField("app", StringType, nullable = true), StructField("server", StringType, nullable = true), StructField("file", StringType, nullable = true), StructField("...", StringType, nullable = true) ) ) val row = ... val dataset = session.createDataFrame(row, schema) How can i split dataset to dataset array by composite key(app, server,file) as follow mapdataset> Thanks.
Can i move TFS and TSFT out of spark package
I have build the spark-assembly-1.6.0-hadoop2.5.1.jar cat spark-assembly-1.6.0-hadoop2.5.1.jar/META-INF/services/org. apache.hadoop.fs.FileSystem ... org.apache.hadoop.hdfs.DistributedFileSystem org.apache.hadoop.hdfs.web.HftpFileSystem org.apache.hadoop.hdfs.web.HsftpFileSystem org.apache.hadoop.hdfs.web.WebHdfsFileSystem org.apache.hadoop.hdfs.web.SWebHdfsFileSystem tachyon.hadoop.TFS tachyon.hadoop.TFSFT Can i move TFS and TSFT out of spark-assembly-1.6.0-hadoop2.5.1.jar How do I modify it before build Thanks.
Re: Why spark.sql.autoBroadcastJoinThreshold not available
Solve it by remove lazy identity. 2.HiveContext.sql("cache table feature as "select * from src where ...) which result size is only 100K Thanks! 2017-05-15 21:26 GMT+08:00 Yong Zhang : > You should post the execution plan here, so we can provide more accurate > support. > > > Since in your feature table, you are building it with projection ("where > "), so my guess is that the following JIRA (SPARK-13383 > <https://issues.apache.org/jira/browse/SPARK-13383>) stops the broadcast > join. This is fixed in the Spark 2.x. Can you try it on Spark 2.0? > > Yong > > -- > *From:* Jone Zhang > *Sent:* Wednesday, May 10, 2017 7:10 AM > *To:* user @spark/'user @spark'/spark users/user@spark > *Subject:* Why spark.sql.autoBroadcastJoinThreshold not available > > Now i use spark1.6.0 in java > I wish the following sql to be executed in BroadcastJoin way > *select * from sample join feature* > > This is my step > 1.set spark.sql.autoBroadcastJoinThreshold=100M > 2.HiveContext.sql("cache lazy table feature as "select * from src where > ...) which result size is only 100K > 3.HiveContext.sql("select * from sample join feature") > Why the join is SortMergeJoin? > > Grateful for any idea! > Thanks. >
How can i merge multiple rows to one row in sparksql or hivesql?
For example Data1(has 1 billion records) user_id1 feature1 user_id1 feature2 Data2(has 1 billion records) user_id1 feature3 Data3(has 1 billion records) user_id1 feature4 user_id1 feature5 ... user_id1 feature100 I want to get the result as follow user_id1 feature1 feature2 feature3 feature4 feature5...feature100 Is there a more efficient way except join? Thanks!
Why spark.sql.autoBroadcastJoinThreshold not available
Now i use spark1.6.0 in java I wish the following sql to be executed in BroadcastJoin way *select * from sample join feature* This is my step 1.set spark.sql.autoBroadcastJoinThreshold=100M 2.HiveContext.sql("cache lazy table feature as "select * from src where ...) which result size is only 100K 3.HiveContext.sql("select * from sample join feature") Why the join is SortMergeJoin? Grateful for any idea! Thanks.
org.apache.hadoop.fs.FileSystem: Provider tachyon.hadoop.TFS could not be instantiated
*When i use sparksql, the error as follows* 17/05/05 15:58:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 20.0 (TID 4080, 10.196.143.233): java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider tachyon.hadoop.TFS could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:224) at java.util.ServiceLoader.access$100(ServiceLoader.java:181) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377) at java.util.ServiceLoader$1.next(ServiceLoader.java:445) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2558) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2569) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2586) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:365) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:654) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:321) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:212) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoClassDefFoundError: Could not initialize class tachyon.hadoop.TFS at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at java.lang.Class.newInstance(Class.java:379) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373) ... 45 more 17/05/05 15:58:44 INFO cluster.YarnClusterScheduler: Removed TaskSet 20.0, whose tasks have all completed, from pool I did not use tachyon directly. *Grateful for any idea!*
Why chinese character gash appear when i use spark textFile?
var textFile = sc.textFile("xxx"); textFile.first(); res1: String = 1.0 100733314 18_?:100733314 8919173c6d49abfab02853458247e5841:129:18_?:1.0 hadoop fs -cat xxx 1.0100733314 18_百度输入法:100733314 8919173c6d49abfab02853458247e584 1:129:18_百度输入法:1.0 Why chinese character gash appear when i use spark textFile? The code of hdfs file is utf-8. Thanks
Is there length limit for sparksql/hivesql?
Is there length limit for sparksql/hivesql? Can antlr work well if sql is too long? Thanks.
Can i display message on console when use spark on yarn?
I submit spark with "spark-submit --master yarn-cluster --deploy-mode cluster" How can i display message on yarn console. I expect it to be like this: . 16/10/20 17:12:53 main INFO org.apache.spark.deploy.yarn.Client>SPK> Application report for application_1453970859007_481440 (state: RUNNING) 16/10/20 17:12:58 main INFO org.apache.spark.deploy.yarn.Client>SPK> Application report for application_1453970859007_481440 (state: RUNNING) 16/10/20 17:13:03 main INFO org.apache.spark.deploy.yarn.Client>SPK> Application report for application_1453970859007_481440 (state: RUNNING) 16/10/20 17:13:08 main INFO org.apache.spark.deploy.yarn.Client>SPK> Application report for application_1453970859007_481440 (state: RUNNING) 16/10/20 17:13:13 main INFO org.apache.spark.deploy.yarn.Client>SPK> Application report for application_1453970859007_481440 (state: FINISHED) 16/10/20 17:13:13 main INFO org.apache.spark.deploy.yarn.Client>SPK> client token: N/A diagnostics: N/A ApplicationMaster host: 10.51.215.100 ApplicationMaster RPC port: 0 queue: root.default start time: 1476954698645 final status: SUCCEEDED tracking URL: http://10.179.20.47:8080/proxy/application_1453970859007_481440/history/application_1453970859007_481440/1 user: mqq ===Spark Task Result is === ===some message want to display=== 16/10/20 17:13:13 Thread-3 INFO org.apache.spark.util.ShutdownHookManager>SPK> Shutdown hook called 16/10/20 17:13:13 Thread-3 INFO org.apache.spark.util.ShutdownHookManager>SPK> Deleting directory /data/home/spark_tmp/spark-5b9f434b-5837-46e6-9625-c4656b86af9e Thanks.
Re: High virtual memory consumption on spark-submit client.
no, i have set master to yarn-cluster. when the sparkpi.running,the result of free -t as follow [running]mqq@10.205.3.29:/data/home/hive/conf$ free -t total used free shared buffers cached Mem: 32740732 32105684 635048 0 683332 28863456 -/+ buffers/cache: 2558896 30181836 Swap: 2088952 60320 2028632 Total: 34829684 32166004 2663680 after sparkpi succes,the result as follow [running]mqq@10.205.3.29:/data/home/hive/conf$ free -t total used free shared buffers cached Mem: 32740732 31614452 1126280 0 683624 28863096 -/+ buffers/cache: 2067732 30673000 Swap: 2088952 60320 2028632 Total: 34829684 31674772 3154912 Mich Talebzadeh <mich.talebza...@gmail.com> 于 2016年5月13日,14:47写道:Is this a standalone set up single host where executor runs inside the driver?also runfree -tTo see the virtual memory usage which is basically swap spacefree -t total used free shared buffers cachedMem: 24546308 24268760 277548 0 1088236 15168668-/+ buffers/cache: 8011856 16534452Swap: 2031608 304 2031304Total: 26577916 24269064 2308852 Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 13 May 2016 at 07:36, Jone Zhang <joyoungzh...@gmail.com> wrote:mich, Do you want this == [running]mqq@10.205.3.29:/data/home/hive/conf$ ps aux | grep SparkPi mqq 20070 3.6 0.8 10445048 267028 pts/16 Sl+ 13:09 0:11 /data/home/jdk/bin/java -Dlog4j.configuration=file:///data/home/spark/conf/log4j.properties -cp /data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/*:/data/home/spark/conf/:/data/home/spark/lib/spark-assembly-1.4.1-hadoop2.5.1_150903.jar:/data/home/spark/lib/datanucleus-api-jdo-3.2.6.jar:/data/home/spark/lib/datanucleus-core-3.2.10.jar:/data/home/spark/lib/datanucleus-rdbms-3.2.9.jar:/data/home/hadoop/conf/:/data/home/hadoop/conf/:/data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/* -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master yarn-cluster --class org.apache.spark.examples.SparkPi --queue spark --num-executors 4 /data/home/spark/lib/spark-examples-1.4.1-hadoop2.5.1.jar 1 mqq 22410 0.0 0.0 110600 1004 pts/8 S+ 13:14 0:00 grep SparkPi [running]mqq@10.205.3.29:/data/home/hive/conf$ top -p 20070 top - 13:14:48 up 504 days, 19:17, 19 users, load average: 1.41, 1.10, 0.99 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu(s): 18.1%us, 2.7%sy, 0.0%ni, 74.4%id, 4.5%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 32740732k total, 31606288k used, 113k free, 475908k buffers Swap: 2088952k total, 61076k used, 2027876k free, 27594452k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20070 mqq 20 0 10.0g 260m 32m S 0.0 0.8 0:11.38 java == Harsh, physical cpu cores is 1, virtual cpu cores is 4 Thanks. 2016-05-13 13:08 GMT+08:00, Harsh J <ha...@cloudera.com>: > How many CPU cores are on that machine? Read http://qr.ae/8Uv3Xq > > You can also confirm the above by running the pmap utility on your process > and most of the virtual memory would be under 'anon'. > > On Fri, 13 May 2016 09:11 jone, <joyoungzh...@gmail.com> wrote: > >> The virtual memory is 9G When i run org.apache.spark.examples.SparkPi >> under yarn-cluster model,which using default configurations. >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >> COMMAND >> >> 4519 mqq 20 0 9041 <2009041>m 248m 26m S 0.3 0.8 0:19.85 >> java >> I am curious why is so high? >> >> Thanks. >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: High virtual memory consumption on spark-submit client.
mich, Do you want this == [running]mqq@10.205.3.29:/data/home/hive/conf$ ps aux | grep SparkPi mqq 20070 3.6 0.8 10445048 267028 pts/16 Sl+ 13:09 0:11 /data/home/jdk/bin/java -Dlog4j.configuration=file:///data/home/spark/conf/log4j.properties -cp /data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/*:/data/home/spark/conf/:/data/home/spark/lib/spark-assembly-1.4.1-hadoop2.5.1_150903.jar:/data/home/spark/lib/datanucleus-api-jdo-3.2.6.jar:/data/home/spark/lib/datanucleus-core-3.2.10.jar:/data/home/spark/lib/datanucleus-rdbms-3.2.9.jar:/data/home/hadoop/conf/:/data/home/hadoop/conf/:/data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/* -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master yarn-cluster --class org.apache.spark.examples.SparkPi --queue spark --num-executors 4 /data/home/spark/lib/spark-examples-1.4.1-hadoop2.5.1.jar 1 mqq 22410 0.0 0.0 110600 1004 pts/8S+ 13:14 0:00 grep SparkPi [running]mqq@10.205.3.29:/data/home/hive/conf$ top -p 20070 top - 13:14:48 up 504 days, 19:17, 19 users, load average: 1.41, 1.10, 0.99 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu(s): 18.1%us, 2.7%sy, 0.0%ni, 74.4%id, 4.5%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 32740732k total, 31606288k used, 113k free, 475908k buffers Swap: 2088952k total,61076k used, 2027876k free, 27594452k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 20070 mqq 20 0 10.0g 260m 32m S 0.0 0.8 0:11.38 java == Harsh, physical cpu cores is 1, virtual cpu cores is 4 Thanks. 2016-05-13 13:08 GMT+08:00, Harsh J : > How many CPU cores are on that machine? Read http://qr.ae/8Uv3Xq > > You can also confirm the above by running the pmap utility on your process > and most of the virtual memory would be under 'anon'. > > On Fri, 13 May 2016 09:11 jone, wrote: > >> The virtual memory is 9G When i run org.apache.spark.examples.SparkPi >> under yarn-cluster model,which using default configurations. >> PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ >> COMMAND >> >> 4519 mqq 20 0 9041 <2009041>m 248m 26m S 0.3 0.8 0:19.85 >> java >> I am curious why is so high? >> >> Thanks. >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
High virtual memory consumption on spark-submit client.
The virtual memory is 9G When i run org.apache.spark.examples.SparkPi under yarn-cluster model,which using default configurations. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4519 mqq 20 0 9041m 248m 26m S 0.3 0.8 0:19.85 java I am curious why is so high? Thanks.