Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread arthur.hk.c...@gmail.com

Hi,

FYI as follows.  Could you post your heap size settings as well your Spark app 
code?

Regards
Arthur

3.1.3 Detail Message: Requested array size exceeds VM limit

The detail message Requested array size exceeds VM limit indicates that the 
application (or APIs used by that application) attempted to allocate an array 
that is larger than the heap size. For example, if an application attempts to 
allocate an array of 512MB but the maximum heap size is 256MB then 
OutOfMemoryError will be thrown with the reason Requested array size exceeds VM 
limit. In most cases the problem is either a configuration issue (heap size too 
small), or a bug that results in an application attempting to create a huge 
array, for example, when the number of elements in the array are computed using 
an algorithm that computes an incorrect size.”




On 2 Nov, 2014, at 12:25 pm, Bharath Ravi Kumar reachb...@gmail.com wrote:

 Resurfacing the thread. Oom shouldn't be the norm for a common groupby / sort 
 use case in a framework that is leading in sorting bench marks? Or is there 
 something fundamentally wrong in the usage?
 
 On 02-Nov-2014 1:06 am, Bharath Ravi Kumar reachb...@gmail.com wrote:
 Hi,
 
 I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of 
 count ~ 100 million. The data size is 20GB and groupBy results in an RDD of 
 1061 keys with values being IterableTuple4String, Integer, Double, 
 String. The job runs on 3 hosts in a standalone setup with each host's 
 executor having 100G RAM and 24 cores dedicated to it. While the groupBy 
 stage completes successfully with ~24GB of shuffle write, the saveAsTextFile 
 fails after repeated retries with each attempt failing due to an out of 
 memory error [1]. I understand that a few partitions may be overloaded as a 
 result of the groupBy and I've tried the following config combinations 
 unsuccessfully:
 
 1) Repartition the initial rdd (44 input partitions but 1061 keys) across 
 1061 paritions and have max cores = 3 so that each key is a logical 
 partition (though many partitions will end up on very few hosts), and each 
 host likely runs saveAsTextFile on a single key at a time due to max cores = 
 3 with 3 hosts in the cluster. The level of parallelism is unspecified.
 
 2) Leave max cores unspecified, set the level of parallelism to 72, and leave 
 number of partitions unspecified (in which case the # input partitions was 
 used, which is 44)
 Since I do not intend to cache RDD's, I have set 
 spark.storage.memoryFraction=0.2 in both cases.
 
 My understanding is that if each host is processing a single logical 
 partition to saveAsTextFile and is reading from other hosts to write out the 
 RDD, it is unlikely that it would run out of memory. My interpretation of the 
 spark tuning guide is that the degree of parallelism has little impact in 
 case (1) above since max cores = number of hosts. Can someone explain why 
 there are still OOM's with 100G being available? On a related note, 
 intuitively (though I haven't read the source), it appears that an entire 
 key-value pair needn't fit into memory of a single host for saveAsTextFile 
 since a single shuffle read from a remote can be written to HDFS before the 
 next remote read is carried out. This way, not all data needs to be collected 
 at the same time. 
 
 Lastly, if an OOM is (but shouldn't be) a common occurrence (as per the 
 tuning guide and even as per Datastax's spark introduction), there may need 
 to be more documentation around the internals of spark to help users take 
 better informed tuning decisions with parallelism, max cores, number 
 partitions and other tunables. Is there any ongoing effort on that front?
 
 Thanks,
 Bharath
 
 
 [1] OOM stack trace and logs
 14/11/01 12:26:37 WARN TaskSetManager: Lost task 61.0 in stage 36.0 (TID 
 1264, proc1.foo.bar.com): java.lang.OutOfMemoryError: Requested array size 
 exceeds VM limit
 java.util.Arrays.copyOf(Arrays.java:3326)
 
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 java.lang.StringBuilder.append(StringBuilder.java:136)
 scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
 scala.Tuple2.toString(Tuple2.scala:22)
 
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
 
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 

Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread arthur.hk.c...@gmail.com
Hi,

My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13?  Please advise.

Or, any news about when will Spark 1.1.0 on Hive 0.1.3.1 be available?

Regards
Arthur
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread arthur.hk.c...@gmail.com
Hi,

Thanks for your update. Any idea when will Spark 1.2 be GA?

Regards
Arthur


On 29 Oct, 2014, at 8:22 pm, Cheng Lian lian.cs@gmail.com wrote:

 Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and 
 related PRs are already merged or being merged to the master branch.
 
 On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote:
 Hi,
 
 My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13?  Please advise.
 
 Or, any news about when will Spark 1.1.0 on Hive 0.1.3.1 be available?
 
 Regards
 Arthur
  
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark/HIVE Insert Into values Error

2014-10-26 Thread arthur.hk.c...@gmail.com
Hi,

I have already found the way about how to “insert into HIVE_TABLE values (…..)

Regards
Arthur

On 18 Oct, 2014, at 10:09 pm, Cheng Lian lian.cs@gmail.com wrote:

 Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT INTO 
 ... VALUES ... syntax.
 
 On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote:
 Hi,
 
 When trying to insert records into HIVE, I got error,
 
 My Spark is 1.1.0 and Hive 0.12.0
 
 Any idea what would be wrong?
 Regards
 Arthur
 
 
 
 hive CREATE TABLE students (name VARCHAR(64), age INT, gpa int);  
 OK
 
 hive INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1);
 NoViableAltException(26@[])
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:693)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:31374)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29083)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238)
  at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938)
  at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
  at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
 FAILED: ParseException line 1:27 cannot recognize input near 'VALUES' '(' 
 ''fred flintstone'' in select clause
 
 
 



Re: Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-24 Thread arthur.hk.c...@gmail.com
)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)14/10/25
 06:50:15 WARN TaskSetManager: Lost task 7.3 in stage 5.0 (TID 575, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 24.2 in stage 5.0 (TID 560, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 22.2 in stage 5.0 (TID 561, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 20.2 in stage 5.0 (TID 564, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 13.2 in stage 5.0 (TID 562, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 27.2 in stage 5.0 (TID 565, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 34.2 in stage 5.0 (TID 568, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have 
all completed, from pool 

Regards
Arthur


On 24 Oct, 2014, at 6:56 am, Michael Armbrust mich...@databricks.com wrote:

 Can you show the DDL for the table?  It looks like the SerDe might be saying 
 it will produce a decimal type but is actually producing a string.
 
 On Thu, Oct 23, 2014 at 3:17 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi
 
 My Spark is 1.1.0 and Hive is 0.12,  I tried to run the same query in both 
 Hive-0.12.0 then Spark-1.1.0,  HiveQL works while SparkSQL failed. 
 
 
 hive select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, 
 o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment = 
 'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on l.l_orderkey = 
 o.o_orderkey where o_orderdate  '1995-03-15' and l_shipdate  '1995-03-15' 
 group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, 
 o_orderdate limit 10;
 Ended Job = job_1414067367860_0011
 MapReduce Jobs Launched: 
 Job 0: Map: 1  Reduce: 1   Cumulative CPU: 2.0 sec   HDFS Read: 261 HDFS 
 Write: 96 SUCCESS
 Job 1: Map: 1  Reduce: 1   Cumulative CPU: 0.88 sec   HDFS Read: 458 HDFS 
 Write: 0 SUCCESS
 Total MapReduce CPU Time Spent: 2 seconds 880 msec
 OK
 Time taken: 38.771 seconds
 
 
 scala sqlContext.sql(select l_orderkey, 
 sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority 
 from customer c join orders o on c.c_mktsegment = 'BUILDING' and c.c_custkey 
 = o.o_custkey join lineitem l on l.l_orderkey = o.o_orderkey where 
 o_orderdate  '1995-03-15' and l_shipdate  '1995-03-15' group by l_orderkey, 
 o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 
 10).collect().foreach

Re: Spark: Order by Failed, java.lang.NullPointerException

2014-10-24 Thread arthur.hk.c...@gmail.com
Hi, 

Added “l_linestatus” it works, THANK YOU!!

sqlContext.sql(select l_linestatus, l_orderkey, l_linenumber, l_partkey, 
l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by 
L_LINESTATUS limit 10).collect().foreach(println);
14/10/25 07:03:24 INFO DAGScheduler: Stage 12 (takeOrdered at 
basicOperators.scala:171) finished in 54.358 s
14/10/25 07:03:24 INFO TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks 
have all completed, from pool 
14/10/25 07:03:24 INFO SparkContext: Job finished: takeOrdered at 
basicOperators.scala:171, took 54.374629175 s
[F,71769288,2,-5859884,13,1993-12-13,R,F]
[F,71769319,4,-4098165,19,1992-10-12,R,F]
[F,71769288,3,2903707,44,1994-10-08,R,F]
[F,71769285,2,-741439,42,1994-04-22,R,F]
[F,71769313,5,-1276467,12,1992-08-15,R,F]
[F,71769314,7,-5595080,13,1992-03-28,A,F]
[F,71769316,1,-1766622,16,1993-12-05,R,F]
[F,71769287,2,-767340,50,1993-06-21,A,F]
[F,71769317,2,665847,15,1992-05-03,A,F]
[F,71769286,1,-5667701,15,1994-04-17,A,F]


Regards 
Arthur




On 24 Oct, 2014, at 2:58 pm, Akhil Das ak...@sigmoidanalytics.com wrote:

 Not sure if this would help, but make sure you are having the column 
 l_linestatus in the data.
 
 Thanks
 Best Regards
 
 On Thu, Oct 23, 2014 at 5:59 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 I got java.lang.NullPointerException. Please help!
 
 
 sqlContext.sql(select l_orderkey, l_linenumber, l_partkey, l_quantity, 
 l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit 
 10).collect().foreach(println);
 
 2014-10-23 08:20:12,024 INFO  [sparkDriver-akka.actor.default-dispatcher-31] 
 scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stage 41 (runJob at 
 basicOperators.scala:136) finished in 0.086 s
 2014-10-23 08:20:12,024 INFO  [Result resolver thread-1] 
 scheduler.TaskSchedulerImpl (Logging.scala:logInfo(59)) - Removed TaskSet 
 41.0, whose tasks have all completed, from pool 
 2014-10-23 08:20:12,024 INFO  [main] spark.SparkContext 
 (Logging.scala:logInfo(59)) - Job finished: runJob at 
 basicOperators.scala:136, took 0.090129332 s
 [9001,6,-4584121,17,1997-01-04,N,O]
 [9002,1,-2818574,23,1996-02-16,N,O]
 [9002,2,-2449102,21,1993-12-12,A,F]
 [9002,3,-5810699,26,1994-04-06,A,F]
 [9002,4,-489283,18,1994-11-11,R,F]
 [9002,5,2169683,15,1997-09-14,N,O]
 [9002,6,2405081,4,1992-08-03,R,F]
 [9002,7,3835341,40,1998-04-28,N,O]
 [9003,1,1900071,4,1994-05-05,R,F]
 [9004,1,-2614665,41,1993-06-13,A,F]
 
 
 If order by L_LINESTATUS” is added then error:
 sqlContext.sql(select l_orderkey, l_linenumber, l_partkey, l_quantity, 
 l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS 
 limit 10).collect().foreach(println);
 
 2014-10-23 08:22:08,524 INFO  [main] parse.ParseDriver 
 (ParseDriver.java:parse(179)) - Parsing command: select l_orderkey, 
 l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS 
 from lineitem order by L_LINESTATUS limit 10
 2014-10-23 08:22:08,525 INFO  [main] parse.ParseDriver 
 (ParseDriver.java:parse(197)) - Parse Completed
 2014-10-23 08:22:08,526 INFO  [main] metastore.HiveMetaStore 
 (HiveMetaStore.java:logInfo(454)) - 0: get_table : db=boc_12 tbl=lineitem
 2014-10-23 08:22:08,526 INFO  [main] HiveMetaStore.audit 
 (HiveMetaStore.java:logAuditEvent(239)) - ugi=hd ip=unknown-ip-addr  
 cmd=get_table : db=boc_12 tbl=lineitem  
 java.lang.NullPointerException
   at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262)
   at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269)
   at 
 org.apache.spark.sql.hive.HadoopTableReader.init(TableReader.scala:63)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:68)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$TakeOrdered$.apply(SparkStrategies.scala:191)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58

Re: Spark Hive Snappy Error

2014-10-23 Thread arthur.hk.c...@gmail.com
HI

Removed export CLASSPATH=$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar”

It works, THANK YOU!!

Regards 
Arthur
 

On 23 Oct, 2014, at 1:00 pm, Shao, Saisai saisai.s...@intel.com wrote:

 Seems you just add snappy library into your classpath:
  
 export CLASSPATH=$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar
  
 But for spark itself, it depends on snappy-0.2.jar. Is there any possibility 
 that this problem caused by different version of snappy?
  
 Thanks
 Jerry
  
 From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] 
 Sent: Thursday, October 23, 2014 11:32 AM
 To: Shao, Saisai
 Cc: arthur.hk.c...@gmail.com; user
 Subject: Re: Spark Hive Snappy Error
  
 Hi,
  
 Please find the attached file.
  
  
  
 my spark-default.xml
 # Default system properties included when running spark-submit.
 # This is useful for setting default environmental settings.
 #
 # Example:
 # spark.masterspark://master:7077
 # spark.eventLog.enabled  true
 # spark.eventLog.dir
   hdfs://namenode:8021/directory
 # spark.serializerorg.apache.spark.serializer.KryoSerializer
 #
 spark.executor.memory   2048m
 spark.shuffle.spill.compressfalse
 spark.io.compression.codec
 org.apache.spark.io.SnappyCompressionCodec
  
  
  
 my spark-env.sh
 #!/usr/bin/env bash
 export CLASSPATH=$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar
 export 
 CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar
 export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64
 export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop}
 export SPARK_WORKER_DIR=/edh/hadoop_data/spark_work/
 export SPARK_LOG_DIR=/edh/hadoop_logs/spark
 export SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64
 export 
 SPARK_CLASSPATH=$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar
 export 
 SPARK_CLASSPATH=$SPARK_CLASSPATH:$HBASE_HOME/lib/*:$HIVE_HOME/csv-serde-1.1.2-0.11.0-all.jar:
 export SPARK_WORKER_MEMORY=2g
 export HADOOP_HEAPSIZE=2000
 export SPARK_DAEMON_JAVA_OPTS=-Dspark.deploy.recoveryMode=ZOOKEEPER 
 -Dspark.deploy.zookeeper.url=m35:2181,m33:2181,m37:2181
 export SPARK_JAVA_OPTS= -XX:+UseConcMarkSweepGC
  
  
 ll $HADOOP_HOME/lib/native/Linux-amd64-64
 -rw-rw-r--. 1 tester tester50523 Aug 27 14:12 hadoop-auth-2.4.1.jar
 -rw-rw-r--. 1 tester tester  1062640 Aug 27 12:19 libhadoop.a
 -rw-rw-r--. 1 tester tester  1487564 Aug 27 11:14 libhadooppipes.a
 lrwxrwxrwx. 1 tester tester   24 Aug 27 07:08 libhadoopsnappy.so - 
 libhadoopsnappy.so.0.0.1
 lrwxrwxrwx. 1 tester tester   24 Aug 27 07:08 libhadoopsnappy.so.0 - 
 libhadoopsnappy.so.0.0.1
 -rwxr-xr-x. 1 tester tester54961 Aug 27 07:08 libhadoopsnappy.so.0.0.1
 -rwxrwxr-x. 1 tester tester   630328 Aug 27 12:19 libhadoop.so
 -rwxrwxr-x. 1 tester tester   630328 Aug 27 12:19 libhadoop.so.1.0.0
 -rw-rw-r--. 1 tester tester   582472 Aug 27 11:14 libhadooputils.a
 -rw-rw-r--. 1 tester tester   298626 Aug 27 11:14 libhdfs.a
 -rwxrwxr-x. 1 tester tester   200370 Aug 27 11:14 libhdfs.so
 -rwxrwxr-x. 1 tester tester   200370 Aug 27 11:14 libhdfs.so.0.0.0
 lrwxrwxrwx. 1 tester tester 55 Aug 27 07:08 libjvm.so 
 -/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server/libjvm.so
 lrwxrwxrwx. 1 tester tester   25 Aug 27 07:08 libprotobuf-lite.so - 
 libprotobuf-lite.so.8.0.0
 lrwxrwxrwx. 1 tester tester   25 Aug 27 07:08 libprotobuf-lite.so.8 
 - libprotobuf-lite.so.8.0.0
 -rwxr-xr-x. 1 tester tester   964689 Aug 27 07:08 
 libprotobuf-lite.so.8.0.0
 lrwxrwxrwx. 1 tester tester   20 Aug 27 07:08 libprotobuf.so - 
 libprotobuf.so.8.0.0
 lrwxrwxrwx. 1 tester tester   20 Aug 27 07:08 libprotobuf.so.8 - 
 libprotobuf.so.8.0.0
 -rwxr-xr-x. 1 tester tester  8300050 Aug 27 07:08 libprotobuf.so.8.0.0
 lrwxrwxrwx. 1 tester tester   18 Aug 27 07:08 libprotoc.so - 
 libprotoc.so.8.0.0
 lrwxrwxrwx. 1 tester tester   18 Aug 27 07:08 libprotoc.so.8 - 
 libprotoc.so.8.0.0
 -rwxr-xr-x. 1 tester tester  9935810 Aug 27 07:08 libprotoc.so.8.0.0
 -rw-r--r--. 1 tester tester   233554 Aug 27 15:19 libsnappy.a
 lrwxrwxrwx. 1 tester tester   23 Aug 27 11:32 libsnappy.so - 
 /usr/lib64/libsnappy.so
 lrwxrwxrwx. 1 tester tester   23 Aug 27 11:33 libsnappy.so.1 - 
 /usr/lib64/libsnappy.so
 -rwxr-xr-x. 1 tester tester   147726 Aug 27 07:08 libsnappy.so.1.2.0
 drwxr-xr-x. 2 tester tester 4096 Aug 27 07:08 pkgconfig
  
  
 Regards
 Arthur
  
  
 On 23 Oct, 2014, at 10:57 am, Shao, Saisai saisai.s...@intel.com wrote:
 
 
 Hi Arthur,
  
 I think your problem might be different from what 
 SPARK-3958(https://issues.apache.org/jira/browse/SPARK-3958) mentioned, seems 
 your problem is more likely to be a library link problem, would you mind 
 checking your Spark runtime to see if the snappy.so is loaded or not? 
 (through lsof -p).
  
 I guess your problem is more likely to be a library not found problem.
  
  
 Thanks

Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread arthur.hk.c...@gmail.com
Hi,

I got $TreeNodeException, few questions:
Q1) How should I do aggregation in SparK? Can I use aggregation directly in 
SQL? or
Q1) Should I use SQL to load the data to form RDD then use scala to do the 
aggregation?

Regards
Arthur


MySQL (good one, without aggregation): 
sqlContext.sql(SELECT L_RETURNFLAG FROM LINEITEM WHERE  
L_SHIPDATE='1998-09-02'  GROUP  BY L_RETURNFLAG, L_LINESTATUS ORDER  BY 
L_RETURNFLAG, L_LINESTATUS).collect().foreach(println);
[A]
[N]
[N]
[R]


My SQL (problem SQL, with aggregation):
sqlContext.sql(SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS SUM_QTY, 
SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - L_DISCOUNT 
)) AS SUM_DISC_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - L_DISCOUNT ) * ( 1 + L_TAX )) 
AS SUM_CHARGE, AVG(L_QUANTITY) AS AVG_QTY, AVG(L_EXTENDEDPRICE) AS AVG_PRICE, 
AVG(L_DISCOUNT) AS AVG_DISC, COUNT(*) AS COUNT_ORDER  FROM LINEITEM WHERE  
L_SHIPDATE='1998-09-02'  GROUP  BY L_RETURNFLAG, L_LINESTATUS ORDER  BY 
L_RETURNFLAG, L_LINESTATUS).collect().foreach(println);

14/10/23 20:38:31 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks 
have all completed, from pool 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
Sort [l_returnflag#200 ASC,l_linestatus#201 ASC], true
 Exchange (RangePartitioning [l_returnflag#200 ASC,l_linestatus#201 ASC], 200)
  Aggregate false, [l_returnflag#200,l_linestatus#201], 
[l_returnflag#200,l_linestatus#201,SUM(PartialSum#216) AS 
sum_qty#181,SUM(PartialSum#217) AS sum_base_price#182,SUM(PartialSum#218) AS 
sum_disc_price#183,SUM(PartialSum#219) AS 
sum_charge#184,(CAST(SUM(PartialSum#220), DoubleType) / 
CAST(SUM(PartialCount#221L), DoubleType)) AS 
avg_qty#185,(CAST(SUM(PartialSum#222), DoubleType) / 
CAST(SUM(PartialCount#223L), DoubleType)) AS 
avg_price#186,(CAST(SUM(PartialSum#224), DoubleType) / 
CAST(SUM(PartialCount#225L), DoubleType)) AS 
avg_disc#187,SUM(PartialCount#226L) AS count_order#188L]
   Exchange (HashPartitioning [l_returnflag#200,l_linestatus#201], 200)
Aggregate true, [l_returnflag#200,l_linestatus#201], 
[l_returnflag#200,l_linestatus#201,COUNT(l_discount#195) AS 
PartialCount#225L,SUM(l_discount#195) AS PartialSum#224,COUNT(1) AS 
PartialCount#226L,SUM((l_extendedprice#194 * (1.0 - l_discount#195))) AS 
PartialSum#218,SUM(l_extendedprice#194) AS PartialSum#217,COUNT(l_quantity#193) 
AS PartialCount#221L,SUM(l_quantity#193) AS 
PartialSum#220,COUNT(l_extendedprice#194) AS 
PartialCount#223L,SUM(l_extendedprice#194) AS 
PartialSum#222,SUM(((l_extendedprice#194 * (1.0 - l_discount#195)) * (1.0 + 
l_tax#196))) AS PartialSum#219,SUM(l_quantity#193) AS PartialSum#216]
 Project 
[l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193]
  Filter (l_shipdate#197 = 1998-09-02)
   HiveTableScan 
[l_shipdate#197,l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193],
 (MetastoreRelation boc_12, lineitem, None), None

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Sort.execute(basicOperators.scala:191)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
at $iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC.init(console:22)
at $iwC.init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at 

Re: Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread arthur.hk.c...@gmail.com
HI,


My step to create LINEITEM:
$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/lineitem
$HADOOP_HOME/bin/hadoop fs -copyFromLocal lineitem.tbl /tpch/lineitem/

Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, 
L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, 
L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, 
L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE 
STRING, L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED 
AS TEXTFILE LOCATION '/tpch/lineitem’;

Regards
Arthur


On 23 Oct, 2014, at 9:36 pm, Yin Huai huaiyin@gmail.com wrote:

 Hello Arthur,
 
 You can use do aggregations in SQL. How did you create LINEITEM?
 
 Thanks,
 
 Yin
 
 On Thu, Oct 23, 2014 at 8:54 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 I got $TreeNodeException, few questions:
 Q1) How should I do aggregation in SparK? Can I use aggregation directly in 
 SQL? or
 Q1) Should I use SQL to load the data to form RDD then use scala to do the 
 aggregation?
 
 Regards
 Arthur
 
 
 MySQL (good one, without aggregation): 
 sqlContext.sql(SELECT L_RETURNFLAG FROM LINEITEM WHERE  
 L_SHIPDATE='1998-09-02'  GROUP  BY L_RETURNFLAG, L_LINESTATUS ORDER  BY 
 L_RETURNFLAG, L_LINESTATUS).collect().foreach(println);
 [A]
 [N]
 [N]
 [R]
 
 
 My SQL (problem SQL, with aggregation):
 sqlContext.sql(SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS 
 SUM_QTY, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - 
 L_DISCOUNT )) AS SUM_DISC_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - L_DISCOUNT ) * ( 
 1 + L_TAX )) AS SUM_CHARGE, AVG(L_QUANTITY) AS AVG_QTY, AVG(L_EXTENDEDPRICE) 
 AS AVG_PRICE, AVG(L_DISCOUNT) AS AVG_DISC, COUNT(*) AS COUNT_ORDER  FROM 
 LINEITEM WHERE  L_SHIPDATE='1998-09-02'  GROUP  BY L_RETURNFLAG, 
 L_LINESTATUS ORDER  BY L_RETURNFLAG, 
 L_LINESTATUS).collect().foreach(println);
 
 14/10/23 20:38:31 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks 
 have all completed, from pool 
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
 Sort [l_returnflag#200 ASC,l_linestatus#201 ASC], true
  Exchange (RangePartitioning [l_returnflag#200 ASC,l_linestatus#201 ASC], 200)
   Aggregate false, [l_returnflag#200,l_linestatus#201], 
 [l_returnflag#200,l_linestatus#201,SUM(PartialSum#216) AS 
 sum_qty#181,SUM(PartialSum#217) AS sum_base_price#182,SUM(PartialSum#218) AS 
 sum_disc_price#183,SUM(PartialSum#219) AS 
 sum_charge#184,(CAST(SUM(PartialSum#220), DoubleType) / 
 CAST(SUM(PartialCount#221L), DoubleType)) AS 
 avg_qty#185,(CAST(SUM(PartialSum#222), DoubleType) / 
 CAST(SUM(PartialCount#223L), DoubleType)) AS 
 avg_price#186,(CAST(SUM(PartialSum#224), DoubleType) / 
 CAST(SUM(PartialCount#225L), DoubleType)) AS 
 avg_disc#187,SUM(PartialCount#226L) AS count_order#188L]
Exchange (HashPartitioning [l_returnflag#200,l_linestatus#201], 200)
 Aggregate true, [l_returnflag#200,l_linestatus#201], 
 [l_returnflag#200,l_linestatus#201,COUNT(l_discount#195) AS 
 PartialCount#225L,SUM(l_discount#195) AS PartialSum#224,COUNT(1) AS 
 PartialCount#226L,SUM((l_extendedprice#194 * (1.0 - l_discount#195))) AS 
 PartialSum#218,SUM(l_extendedprice#194) AS 
 PartialSum#217,COUNT(l_quantity#193) AS PartialCount#221L,SUM(l_quantity#193) 
 AS PartialSum#220,COUNT(l_extendedprice#194) AS 
 PartialCount#223L,SUM(l_extendedprice#194) AS 
 PartialSum#222,SUM(((l_extendedprice#194 * (1.0 - l_discount#195)) * (1.0 + 
 l_tax#196))) AS PartialSum#219,SUM(l_quantity#193) AS PartialSum#216]
  Project 
 [l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193]
   Filter (l_shipdate#197 = 1998-09-02)
HiveTableScan 
 [l_shipdate#197,l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193],
  (MetastoreRelation boc_12, lineitem, None), None
 
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
   at org.apache.spark.sql.execution.Sort.execute(basicOperators.scala:191)
   at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789

Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-23 Thread arthur.hk.c...@gmail.com
(Please ignore if duplicated)


Hi,

My Spark is 1.1.0 and Hive is 0.12,  I tried to run the same query in both 
Hive-0.12.0 then Spark-1.1.0,  HiveQL works while SparkSQL failed. 

hive select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, 
o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment = 
'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on l.l_orderkey = 
o.o_orderkey where o_orderdate  '1995-03-15' and l_shipdate  '1995-03-15' 
group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, 
o_orderdate limit 10;
Ended Job = job_1414067367860_0011
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 2.0 sec   HDFS Read: 261 HDFS Write: 
96 SUCCESS
Job 1: Map: 1  Reduce: 1   Cumulative CPU: 0.88 sec   HDFS Read: 458 HDFS 
Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 880 msec
OK
Time taken: 38.771 seconds


scala sqlContext.sql(select l_orderkey, sum(l_extendedprice*(1-l_discount)) 
as revenue, o_orderdate, o_shippriority from customer c join orders o on 
c.c_mktsegment = 'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on 
l.l_orderkey = o.o_orderkey where o_orderdate  '1995-03-15' and l_shipdate  
'1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue 
desc, o_orderdate limit 10).collect().foreach(println);
org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in 
stage 5.0 failed 4 times, most recent failure: Lost task 14.3 in stage 5.0 (TID 
568, m34): java.lang.ClassCastException: java.lang.String cannot be cast to 
scala.math.BigDecimal
scala.math.Numeric$BigDecimalIsFractional$.minus(Numeric.scala:182)

org.apache.spark.sql.catalyst.expressions.Subtract$$anonfun$eval$3.apply(arithmetic.scala:64)

org.apache.spark.sql.catalyst.expressions.Subtract$$anonfun$eval$3.apply(arithmetic.scala:64)

org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:114)

org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:64)

org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)

org.apache.spark.sql.catalyst.expressions.Multiply.eval(arithmetic.scala:70)

org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:47)

org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)
org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:58)

org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:69)

org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:433)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2014-10-22 20:23:17,038 INFO  [sparkDriver-akka.actor.default-dispatcher-14] 
remote.RemoteActorRefProvider$RemotingTerminator 
(Slf4jLogger.scala:apply$mcV$sp(74)) - Shutting down remote daemon.
2014-10-22 20:23:17,039 INFO  [sparkDriver-akka.actor.default-dispatcher-14] 
remote.RemoteActorRefProvider$RemotingTerminator 
(Slf4jLogger.scala:apply$mcV$sp(74)) - Remote daemon shut down; proceeding with 
flushing remote transports.

 
Regards
Arthur

On 17 Oct, 2014, at 9:33 am, Shao, Saisai saisai.s...@intel.com wrote:

 Hi Arthur,
  
 I think this is a known issue in Spark, you can check 
 (https://issues.apache.org/jira/browse/SPARK-3958). I’m curious about it, can 
 you always reproduce this issue, Is this issue related to some specific data 
 sets, would you mind giving me some information about you workload, Spark 
 configuration, JDK version and OS version?
  
 Thanks
 Jerry
  
 From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] 
 Sent: Friday, October 17, 2014 7:13 AM
 To: user
 Cc: arthur.hk.c...@gmail.com
 Subject: Spark Hive Snappy Error
  
 Hi,
  
 When trying Spark with Hive table, I got the “java.lang.UnsatisfiedLinkError: 
 org.xerial.snappy.SnappyNative.maxCompressedLength(I)I” error,
  
  
 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext.sql(“select count(1) from q8_national_market_share
 sqlContext.sql(select count(1) from 
 q8_national_market_share).collect().foreach(println)
 java.lang.UnsatisfiedLinkError: 
 org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
  at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
  at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
  at 
 org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)
  at 
 org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
  at 
 org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
  at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
  at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68)
  at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
  at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
  at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
  at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
  at 
 org.apache.spark.sql.hive.HadoopTableReader.init(TableReader.scala:68)
  at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:68)
  at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
  at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
  at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
  at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
  at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
  at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
  at 
 org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146)
  at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
  at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
  at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
  at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
  at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
  at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
  at $iwC$$iwC$$iwC$$iwC.init(console:15)
  at $iwC$$iwC$$iwC.init(console:20)
  at $iwC$$iwC.init(console:22)
  at $iwC.init(console:24)
  at init(console:26)
  at .init(console:30

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,

FYI, I use snappy-java-1.0.4.1.jar

Regards
Arthur


On 22 Oct, 2014, at 8:59 pm, Shao, Saisai saisai.s...@intel.com wrote:

 Thanks a lot, I will try to reproduce this in my local settings and dig into 
 the details, thanks for your information.
  
  
 BR
 Jerry
  
 From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] 
 Sent: Wednesday, October 22, 2014 8:35 PM
 To: Shao, Saisai
 Cc: arthur.hk.c...@gmail.com; user
 Subject: Re: Spark Hive Snappy Error
  
 Hi,
  
 Yes, I can always reproduce the issue:
  
 about you workload, Spark configuration, JDK version and OS version?
  
 I ran SparkPI 1000
  
 java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
  
 cat /etc/centos-release
 CentOS release 6.5 (Final)
  
 My Spark’s hive-site.xml with following:
  property
   namehive.exec.compress.output/name
   valuetrue/value
  /property
  
  property
   namemapred.output.compression.codec/name
   valueorg.apache.hadoop.io.compress.SnappyCodec/value
  /property
  
  property
   namemapred.output.compression.type/name
   valueBLOCK/value
  /property
  
 e.g.
 MASTER=spark://m1:7077,m2:7077 ./bin/run-example SparkPi 1000
 2014-10-22 20:23:17,033 ERROR [sparkDriver-akka.actor.default-dispatcher-18] 
 actor.ActorSystemImpl (Slf4jLogger.scala:apply$mcV$sp(66)) - Uncaught fatal 
 error from thread [sparkDriver-akka.actor.default-dispatcher-2] shutting down 
 ActorSystem [sparkDriver]
 java.lang.UnsatisfiedLinkError: 
 org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
  at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
  at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
  at 
 org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)
  at 
 org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
  at 
 org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
  at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
  at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68)
  at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
  at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
  at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
  at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
  at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:829)
  at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:769)
  at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:753)
  at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1360)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 2014-10-22 20:23:17,036 INFO  [main] scheduler.DAGScheduler 
 (Logging.scala:logInfo(59)) - Failed to run reduce at SparkPi.scala:35
 Exception in thread main org.apache.spark.SparkException: Job cancelled 
 because SparkContext was shut down
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:694)
  at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:693)
  at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
  at 
 org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:693)
  at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1399)
  at 
 akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:201)
  at 
 akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
  at akka.actor.ActorCell.terminate(ActorCell.scala:338)
  at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431

ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,

I just tried sample PI calculation on Spark Cluster, after returning the Pi 
result, it shows ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m37,35662) not found

./bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master 
spark://m33:7077   --executor-memory 512m  --total-executor-cores 40  
examples/target/spark-examples_2.10-1.1.0.jar 100

14/10/23 05:09:03 INFO TaskSetManager: Finished task 87.0 in stage 0.0 (TID 87) 
in 346 ms on m134.emblocsoft.net (99/100)
14/10/23 05:09:03 INFO TaskSetManager: Finished task 98.0 in stage 0.0 (TID 98) 
in 262 ms on m134.emblocsoft.net (100/100)
14/10/23 05:09:03 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
14/10/23 05:09:03 INFO DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) 
finished in 2.597 s
14/10/23 05:09:03 INFO SparkContext: Job finished: reduce at SparkPi.scala:35, 
took 2.725328861 s
Pi is roughly 3.1414948
14/10/23 05:09:03 INFO SparkUI: Stopped Spark web UI at http://m33:4040
14/10/23 05:09:03 INFO DAGScheduler: Stopping DAGScheduler
14/10/23 05:09:03 INFO SparkDeploySchedulerBackend: Shutting down all executors
14/10/23 05:09:03 INFO SparkDeploySchedulerBackend: Asking each executor to 
shut down
14/10/23 05:09:04 INFO ConnectionManager: Key not valid ? 
sun.nio.ch.SelectionKeyImpl@37852165
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m37,35662)
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m37,35662)
14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m37,35662) not found
14/10/23 05:09:04 INFO ConnectionManager: key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@37852165
java.nio.channels.CancelledKeyException
at 
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at 
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m36,34230)
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m35,50371)
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m36,34230)
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m34,41562)
14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m36,34230) not found
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m35,50371)
14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m35,50371) not found
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m33,39517)
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m33,39517)
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m34,41562)
14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m33,39517) not found
14/10/23 05:09:04 ERROR SendingConnection: Exception while reading 
SendingConnection to ConnectionManagerId(m34,41562)
java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
at 
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/10/23 05:09:04 INFO ConnectionManager: Handling connection error on 
connection to ConnectionManagerId(m34,41562)
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m34,41562)
14/10/23 05:09:04 INFO ConnectionManager: Key not valid ? 
sun.nio.ch.SelectionKeyImpl@2e0b5c4a
14/10/23 05:09:04 INFO ConnectionManager: key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@2e0b5c4a
java.nio.channels.CancelledKeyException
at 
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at 
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/10/23 05:09:04 INFO ConnectionManager: Key not valid ? 
sun.nio.ch.SelectionKeyImpl@653f8844
14/10/23 05:09:04 INFO ConnectionManager: key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@653f8844
java.nio.channels.CancelledKeyException
at 
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at 

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,

I have managed to resolve it because a wrong setting. Please ignore this .

Regards
Arthur

On 23 Oct, 2014, at 5:14 am, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 
 14/10/23 05:09:04 WARN ConnectionManager: All connections not cleaned up
 



Spark: Order by Failed, java.lang.NullPointerException

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,

I got java.lang.NullPointerException. Please help!


sqlContext.sql(select l_orderkey, l_linenumber, l_partkey, l_quantity, 
l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit 
10).collect().foreach(println);

2014-10-23 08:20:12,024 INFO  [sparkDriver-akka.actor.default-dispatcher-31] 
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stage 41 (runJob at 
basicOperators.scala:136) finished in 0.086 s
2014-10-23 08:20:12,024 INFO  [Result resolver thread-1] 
scheduler.TaskSchedulerImpl (Logging.scala:logInfo(59)) - Removed TaskSet 41.0, 
whose tasks have all completed, from pool 
2014-10-23 08:20:12,024 INFO  [main] spark.SparkContext 
(Logging.scala:logInfo(59)) - Job finished: runJob at basicOperators.scala:136, 
took 0.090129332 s
[9001,6,-4584121,17,1997-01-04,N,O]
[9002,1,-2818574,23,1996-02-16,N,O]
[9002,2,-2449102,21,1993-12-12,A,F]
[9002,3,-5810699,26,1994-04-06,A,F]
[9002,4,-489283,18,1994-11-11,R,F]
[9002,5,2169683,15,1997-09-14,N,O]
[9002,6,2405081,4,1992-08-03,R,F]
[9002,7,3835341,40,1998-04-28,N,O]
[9003,1,1900071,4,1994-05-05,R,F]
[9004,1,-2614665,41,1993-06-13,A,F]


If order by L_LINESTATUS” is added then error:
sqlContext.sql(select l_orderkey, l_linenumber, l_partkey, l_quantity, 
l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS 
limit 10).collect().foreach(println);

2014-10-23 08:22:08,524 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(179)) - Parsing command: select l_orderkey, 
l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS 
from lineitem order by L_LINESTATUS limit 10
2014-10-23 08:22:08,525 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(197)) - Parse Completed
2014-10-23 08:22:08,526 INFO  [main] metastore.HiveMetaStore 
(HiveMetaStore.java:logInfo(454)) - 0: get_table : db=boc_12 tbl=lineitem
2014-10-23 08:22:08,526 INFO  [main] HiveMetaStore.audit 
(HiveMetaStore.java:logAuditEvent(239)) - ugi=hd   ip=unknown-ip-addr  
cmd=get_table : db=boc_12 tbl=lineitem  
java.lang.NullPointerException
at 
org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262)
at 
org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269)
at 
org.apache.spark.sql.hive.HadoopTableReader.init(TableReader.scala:63)
at 
org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:68)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$TakeOrdered$.apply(SparkStrategies.scala:191)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
at $iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC.init(console:22)
at $iwC.init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,Please find the attached file.{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
{\fonttbl\f0\fnil\fcharset0 Menlo-Regular;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww26300\viewh12480\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural

\f0\fs22 \cf0 \CocoaLigature0 lsof -p 16459 (Master)\
COMMAND   PIDUSER   FD   TYPE DEVICE  SIZE/OFF NODE NAME\
java16459 tester  cwdDIR  253,2  4096  6039786 /hadoop/spark-1.1.0_patched\
java16459 tester  rtdDIR  253,0  40962 /\
java16459 tester  txtREG  253,0 12150  2780995 /usr/lib/jvm/jdk1.7.0_67/bin/java\
java16459 tester  memREG  253,0156928  2228230 /lib64/ld-2.12.so\
java16459 tester  memREG  253,0   1926680  2228250 /lib64/libc-2.12.so\
java16459 tester  memREG  253,0145896  2228251 /lib64/libpthread-2.12.so\
java16459 tester  memREG  253,0 22536  2228254 /lib64/libdl-2.12.so\
java16459 tester  memREG  253,0109006  2759278 /usr/lib/jvm/jdk1.7.0_67/lib/amd64/jli/libjli.so\
java16459 tester  memREG  253,0599384  2228264 /lib64/libm-2.12.so\
java16459 tester  memREG  253,0 47064  2228295 /lib64/librt-2.12.so\
java16459 tester  memREG  253,0113952  2228328 /lib64/libresolv-2.12.so\
java16459 tester  memREG  253,0  99158576  2388225 /usr/lib/locale/locale-archive\
java16459 tester  memREG  253,0 27424  2228249 /lib64/libnss_dns-2.12.so\
java16459 tester  memREG  253,2 138832345  6555616 /hadoop/spark-1.1.0_patched/assembly/target/scala-2.10/spark-assembly-1.1.0-hadoop2.4.1.jar\
java16459 tester  memREG  253,0580624  2893171 /usr/lib/jvm/jdk1.7.0_67/jre/lib/jsse.jar\
java16459 tester  memREG  253,0114742  2893221 /usr/lib/jvm/jdk1.7.0_67/jre/lib/amd64/libnet.so\
java16459 tester  memREG  253,0 91178  2893222 /usr/lib/jvm/jdk1.7.0_67/jre/lib/amd64/libnio.so\
java16459 tester  memREG  253,2   1769726  6816963 /hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-rdbms-3.2.1.jar\
java16459 tester  memREG  253,2337012  6816961 /hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar\
java16459 tester  memREG  253,2   1801810  6816962 /hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-core-3.2.2.jar\
java16459 tester  memREG  253,2 25153  7079998 /hadoop/hive-0.12.0-bin/csv-serde-1.1.2-0.11.0-all.jar\
java16459 tester  memREG  253,2 21817  6032989 /hadoop/hbase-0.98.5-hadoop2/lib/gmbal-api-only-3.0.0-b023.jar\
java16459 tester  memREG  253,2177131  6032940 /hadoop/hbase-0.98.5-hadoop2/lib/jetty-util-6.1.26.jar\
java16459 tester  memREG  253,2 32677  6032915 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-hadoop-compat-0.98.5-hadoop2.jar\
java16459 tester  memREG  253,2143602  6032959 /hadoop/hbase-0.98.5-hadoop2/lib/commons-digester-1.8.jar\
java16459 tester  memREG  253,2 97738  6032917 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-prefix-tree-0.98.5-hadoop2.jar\
java16459 tester  memREG  253,2 17884  6032949 /hadoop/hbase-0.98.5-hadoop2/lib/jackson-jaxrs-1.8.8.jar\
java16459 tester  memREG  253,2253086  6032987 /hadoop/hbase-0.98.5-hadoop2/lib/grizzly-http-2.1.2.jar\
java16459 tester  memREG  253,2 73778  6032916 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-hadoop2-compat-0.98.5-hadoop2.jar\
java16459 tester  memREG  253,2336904  6032993 /hadoop/hbase-0.98.5-hadoop2/lib/grizzly-http-servlet-2.1.2.jar\
java16459 tester  memREG  253,2927415  6032914 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-client-0.98.5-hadoop2.jar\
java16459 tester  memREG  253,2125740  6033008 /hadoop/hbase-0.98.5-hadoop2/lib/hadoop-yarn-server-applicationhistoryservice-2.4.1.jar\
java16459 tester  memREG  253,2 15010  6032936 /hadoop/hbase-0.98.5-hadoop2/lib/xmlenc-0.52.jar\
java16459 tester  memREG  253,2 60686  6032926 /hadoop/hbase-0.98.5-hadoop2/lib/commons-logging-1.1.1.jar\
java16459 tester  memREG  253,2259600  6032927 /hadoop/hbase-0.98.5-hadoop2/lib/commons-codec-1.7.jar\
java16459 tester  memREG  253,2321806  6032957 /hadoop/hbase-0.98.5-hadoop2/lib/jets3t-0.6.1.jar\
java16459 tester  memREG  253,2 85353  6032982 /hadoop/hbase-0.98.5-hadoop2/lib/javax.servlet-api-3.0.1.jar\
java16459 tester  mem

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi

May I know where to configure Spark to load libhadoop.so?

Regards
Arthur

On 23 Oct, 2014, at 11:31 am, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 Please find the attached file.
 
 lsof.rtf
 
 
 my spark-default.xml
 # Default system properties included when running spark-submit.
 # This is useful for setting default environmental settings.
 #
 # Example:
 # spark.masterspark://master:7077
 # spark.eventLog.enabled  true
 # spark.eventLog.dirhdfs://namenode:8021/directory
 # spark.serializerorg.apache.spark.serializer.KryoSerializer
 #
 spark.executor.memory   2048m
 spark.shuffle.spill.compressfalse
 spark.io.compression.codecorg.apache.spark.io.SnappyCompressionCodec
 
 
 
 my spark-env.sh
 #!/usr/bin/env bash
 export CLASSPATH=$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar
 export 
 CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar
 export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64
 export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop}
 export SPARK_WORKER_DIR=/edh/hadoop_data/spark_work/
 export SPARK_LOG_DIR=/edh/hadoop_logs/spark
 export SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64
 export 
 SPARK_CLASSPATH=$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar
 export 
 SPARK_CLASSPATH=$SPARK_CLASSPATH:$HBASE_HOME/lib/*:$HIVE_HOME/csv-serde-1.1.2-0.11.0-all.jar:
 export SPARK_WORKER_MEMORY=2g
 export HADOOP_HEAPSIZE=2000
 export SPARK_DAEMON_JAVA_OPTS=-Dspark.deploy.recoveryMode=ZOOKEEPER 
 -Dspark.deploy.zookeeper.url=m35:2181,m33:2181,m37:2181
 export SPARK_JAVA_OPTS= -XX:+UseConcMarkSweepGC
 
 
 ll $HADOOP_HOME/lib/native/Linux-amd64-64
 -rw-rw-r--. 1 tester tester50523 Aug 27 14:12 hadoop-auth-2.4.1.jar
 -rw-rw-r--. 1 tester tester  1062640 Aug 27 12:19 libhadoop.a
 -rw-rw-r--. 1 tester tester  1487564 Aug 27 11:14 libhadooppipes.a
 lrwxrwxrwx. 1 tester tester   24 Aug 27 07:08 libhadoopsnappy.so - 
 libhadoopsnappy.so.0.0.1
 lrwxrwxrwx. 1 tester tester   24 Aug 27 07:08 libhadoopsnappy.so.0 - 
 libhadoopsnappy.so.0.0.1
 -rwxr-xr-x. 1 tester tester54961 Aug 27 07:08 libhadoopsnappy.so.0.0.1
 -rwxrwxr-x. 1 tester tester   630328 Aug 27 12:19 libhadoop.so
 -rwxrwxr-x. 1 tester tester   630328 Aug 27 12:19 libhadoop.so.1.0.0
 -rw-rw-r--. 1 tester tester   582472 Aug 27 11:14 libhadooputils.a
 -rw-rw-r--. 1 tester tester   298626 Aug 27 11:14 libhdfs.a
 -rwxrwxr-x. 1 tester tester   200370 Aug 27 11:14 libhdfs.so
 -rwxrwxr-x. 1 tester tester   200370 Aug 27 11:14 libhdfs.so.0.0.0
 lrwxrwxrwx. 1 tester tester 55 Aug 27 07:08 libjvm.so - 
 /usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server/libjvm.so
 lrwxrwxrwx. 1 tester tester   25 Aug 27 07:08 libprotobuf-lite.so - 
 libprotobuf-lite.so.8.0.0
 lrwxrwxrwx. 1 tester tester   25 Aug 27 07:08 libprotobuf-lite.so.8 
 - libprotobuf-lite.so.8.0.0
 -rwxr-xr-x. 1 tester tester   964689 Aug 27 07:08 
 libprotobuf-lite.so.8.0.0
 lrwxrwxrwx. 1 tester tester   20 Aug 27 07:08 libprotobuf.so - 
 libprotobuf.so.8.0.0
 lrwxrwxrwx. 1 tester tester   20 Aug 27 07:08 libprotobuf.so.8 - 
 libprotobuf.so.8.0.0
 -rwxr-xr-x. 1 tester tester  8300050 Aug 27 07:08 libprotobuf.so.8.0.0
 lrwxrwxrwx. 1 tester tester   18 Aug 27 07:08 libprotoc.so - 
 libprotoc.so.8.0.0
 lrwxrwxrwx. 1 tester tester   18 Aug 27 07:08 libprotoc.so.8 - 
 libprotoc.so.8.0.0
 -rwxr-xr-x. 1 tester tester  9935810 Aug 27 07:08 libprotoc.so.8.0.0
 -rw-r--r--. 1 tester tester   233554 Aug 27 15:19 libsnappy.a
 lrwxrwxrwx. 1 tester tester   23 Aug 27 11:32 libsnappy.so - 
 /usr/lib64/libsnappy.so
 lrwxrwxrwx. 1 tester tester   23 Aug 27 11:33 libsnappy.so.1 - 
 /usr/lib64/libsnappy.so
 -rwxr-xr-x. 1 tester tester   147726 Aug 27 07:08 libsnappy.so.1.2.0
 drwxr-xr-x. 2 tester tester 4096 Aug 27 07:08 pkgconfig
 
 
 Regards
 Arthur
 
 
 On 23 Oct, 2014, at 10:57 am, Shao, Saisai saisai.s...@intel.com wrote:
 
 Hi Arthur,
  
 I think your problem might be different from what 
 SPARK-3958(https://issues.apache.org/jira/browse/SPARK-3958) mentioned, 
 seems your problem is more likely to be a library link problem, would you 
 mind checking your Spark runtime to see if the snappy.so is loaded or not? 
 (through lsof -p).
  
 I guess your problem is more likely to be a library not found problem.
  
  
 Thanks
 Jerry
  
 



Spark/HIVE Insert Into values Error

2014-10-17 Thread arthur.hk.c...@gmail.com
Hi,

When trying to insert records into HIVE, I got error,

My Spark is 1.1.0 and Hive 0.12.0

Any idea what would be wrong?
Regards
Arthur



hive CREATE TABLE students (name VARCHAR(64), age INT, gpa int);  
OK

hive INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1);
NoViableAltException(26@[])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:693)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:31374)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29083)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:27 cannot recognize input near 'VALUES' '(' 
''fred flintstone'' in select clause




Spark Hive Snappy Error

2014-10-16 Thread arthur.hk.c...@gmail.com
Hi,

When trying Spark with Hive table, I got the “java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I” error,


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql(“select count(1) from q8_national_market_share
sqlContext.sql(select count(1) from 
q8_national_market_share).collect().foreach(println)
java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
at 
org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)
at 
org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
at 
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
at 
org.apache.spark.sql.hive.HadoopTableReader.init(TableReader.scala:68)
at 
org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:68)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
at $iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC.init(console:22)
at $iwC.init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at 

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-13 Thread arthur.hk.c...@gmail.com
Hi,

Thank you so much!

By the way, what is the DATEADD function in Scala/Spark? or how to implement  
DATEADD(MONTH, 3, '2013-07-01')” and DATEADD(YEAR, 1, '2014-01-01')” in Spark 
or Hive? 

Regards
Arthur


On 12 Oct, 2014, at 12:03 pm, Ilya Ganelin ilgan...@gmail.com wrote:

 Because of how closures work in Scala, there is no support for nested 
 map/rdd-based operations. Specifically, if you have
 
 Context a {
 Context b {
 
 }
 }
 
 Operations within context b, when distributed across nodes, will no longer 
 have visibility of variables specific to context a because that context is 
 not distributed alongside that operation!
 
 To get around this you need to serialize your operations. For example , run a 
 map job. Take the output of that and run a second map job to filter. Another 
 option is to run two separate map jobs and join their results. Keep in mind 
 that another useful technique is to execute the groupByKey routine , 
 particularly if you want to operate on a particular variable.
 
 On Oct 11, 2014 11:09 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1 
 subquery in my Spark SQL, below are my sample table structures and a SQL that 
 contains more than 1 subquery. 
 
 Question 1:  How to load a HIVE table into Scala/Spark?
 Question 2:  How to implement a SQL_WITH_MORE_THAN_ONE_SUBQUERY  in 
 SCALA/SPARK?
 Question 3:  What is the DATEADD function in Scala/Spark? or how to implement 
  DATEADD(MONTH, 3, '2013-07-01')” and DATEADD(YEAR, 1, '2014-01-01')” in 
 Spark or Hive? 
 I can find HIVE (date_add(string startdate, int days)) but it is in days not 
 MONTH / YEAR.
 
 Thanks.
 
 Regards
 Arthur
 
 ===
 My sample SQL with more than 1 subquery: 
 SELECT S_NAME, 
COUNT(*) AS NUMWAIT 
 FROM   SUPPLIER, 
LINEITEM L1, 
ORDERS
 WHERE  S_SUPPKEY = L1.L_SUPPKEY 
AND O_ORDERKEY = L1.L_ORDERKEY 
AND O_ORDERSTATUS = 'F' 
AND L1.L_RECEIPTDATE  L1.L_COMMITDATE 
AND EXISTS (SELECT * 
FROM   LINEITEM L2 
WHERE  L2.L_ORDERKEY = L1.L_ORDERKEY 
   AND L2.L_SUPPKEY  L1.L_SUPPKEY) 
AND NOT EXISTS (SELECT * 
FROM   LINEITEM L3 
WHERE  L3.L_ORDERKEY = L1.L_ORDERKEY 
   AND L3.L_SUPPKEY  L1.L_SUPPKEY 
   AND L3.L_RECEIPTDATE  L3.L_COMMITDATE) 
 GROUP  BY S_NAME 
 ORDER  BY NUMWAIT DESC, S_NAME
 limit 100;  
 
 
 ===
 Supplier Table:
 CREATE TABLE IF NOT EXISTS SUPPLIER (
 S_SUPPKEY INTEGER PRIMARY KEY,
 S_NAMECHAR(25),
 S_ADDRESS VARCHAR(40),
 S_NATIONKEY   BIGINT NOT NULL, 
 S_PHONE   CHAR(15),
 S_ACCTBAL DECIMAL,
 S_COMMENT VARCHAR(101)
 ) 
 
 ===
 Order Table:
 CREATE TABLE IF NOT EXISTS ORDERS (
 O_ORDERKEYINTEGER PRIMARY KEY,
 O_CUSTKEY BIGINT NOT NULL, 
 O_ORDERSTATUS CHAR(1),
 O_TOTALPRICE  DECIMAL,
 O_ORDERDATE   CHAR(10),
 O_ORDERPRIORITY   CHAR(15),
 O_CLERK   CHAR(15),
 O_SHIPPRIORITYINTEGER,
 O_COMMENT VARCHAR(79)
 
 ===
 LineItem Table:
 CREATE TABLE IF NOT EXISTS LINEITEM (
 L_ORDERKEY  BIGINT not null,
 L_PARTKEY   BIGINT,
 L_SUPPKEY   BIGINT,
 L_LINENUMBERINTEGER not null,
 L_QUANTITY  DECIMAL,
 L_EXTENDEDPRICE DECIMAL,
 L_DISCOUNT  DECIMAL,
 L_TAX   DECIMAL,
 L_SHIPDATE  CHAR(10),
 L_COMMITDATECHAR(10),
 L_RECEIPTDATE   CHAR(10),
 L_RETURNFLAGCHAR(1),
 L_LINESTATUSCHAR(1),
 L_SHIPINSTRUCT  CHAR(25),
 L_SHIPMODE  CHAR(10),
 L_COMMENT   VARCHAR(44),
 CONSTRAINT pk PRIMARY KEY (L_ORDERKEY, L_LINENUMBER )
 )
 



How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread arthur.hk.c...@gmail.com
Hi,

My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1 
subquery in my Spark SQL, below are my sample table structures and a SQL that 
contains more than 1 subquery. 

Question 1:  How to load a HIVE table into Scala/Spark?
Question 2:  How to implement a SQL_WITH_MORE_THAN_ONE_SUBQUERY  in SCALA/SPARK?
Question 3:  What is the DATEADD function in Scala/Spark? or how to implement  
DATEADD(MONTH, 3, '2013-07-01')” and DATEADD(YEAR, 1, '2014-01-01')” in Spark 
or Hive? 
I can find HIVE (date_add(string startdate, int days)) but it is in days not 
MONTH / YEAR.

Thanks.

Regards
Arthur

===
My sample SQL with more than 1 subquery: 
SELECT S_NAME, 
   COUNT(*) AS NUMWAIT 
FROM   SUPPLIER, 
   LINEITEM L1, 
   ORDERS
WHERE  S_SUPPKEY = L1.L_SUPPKEY 
   AND O_ORDERKEY = L1.L_ORDERKEY 
   AND O_ORDERSTATUS = 'F' 
   AND L1.L_RECEIPTDATE  L1.L_COMMITDATE 
   AND EXISTS (SELECT * 
   FROM   LINEITEM L2 
   WHERE  L2.L_ORDERKEY = L1.L_ORDERKEY 
  AND L2.L_SUPPKEY  L1.L_SUPPKEY) 
   AND NOT EXISTS (SELECT * 
   FROM   LINEITEM L3 
   WHERE  L3.L_ORDERKEY = L1.L_ORDERKEY 
  AND L3.L_SUPPKEY  L1.L_SUPPKEY 
  AND L3.L_RECEIPTDATE  L3.L_COMMITDATE) 
GROUP  BY S_NAME 
ORDER  BY NUMWAIT DESC, S_NAME
limit 100;  


===
Supplier Table:
CREATE TABLE IF NOT EXISTS SUPPLIER (
S_SUPPKEY   INTEGER PRIMARY KEY,
S_NAME  CHAR(25),
S_ADDRESS   VARCHAR(40),
S_NATIONKEY BIGINT NOT NULL, 
S_PHONE CHAR(15),
S_ACCTBAL   DECIMAL,
S_COMMENT   VARCHAR(101)
) 

===
Order Table:
CREATE TABLE IF NOT EXISTS ORDERS (
O_ORDERKEY  INTEGER PRIMARY KEY,
O_CUSTKEY   BIGINT NOT NULL, 
O_ORDERSTATUS   CHAR(1),
O_TOTALPRICEDECIMAL,
O_ORDERDATE CHAR(10),
O_ORDERPRIORITY CHAR(15),
O_CLERK CHAR(15),
O_SHIPPRIORITY  INTEGER,
O_COMMENT   VARCHAR(79)

===
LineItem Table:
CREATE TABLE IF NOT EXISTS LINEITEM (
L_ORDERKEY  BIGINT not null,
L_PARTKEY   BIGINT,
L_SUPPKEY   BIGINT,
L_LINENUMBERINTEGER not null,
L_QUANTITY  DECIMAL,
L_EXTENDEDPRICE DECIMAL,
L_DISCOUNT  DECIMAL,
L_TAX   DECIMAL,
L_SHIPDATE  CHAR(10),
L_COMMITDATECHAR(10),
L_RECEIPTDATE   CHAR(10),
L_RETURNFLAGCHAR(1),
L_LINESTATUSCHAR(1),
L_SHIPINSTRUCT  CHAR(25),
L_SHIPMODE  CHAR(10),
L_COMMENT   VARCHAR(44),
CONSTRAINT pk PRIMARY KEY (L_ORDERKEY, L_LINENUMBER )
)



Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread arthur.hk.c...@gmail.com
Wonderful !!

On 11 Oct, 2014, at 12:00 am, Nan Zhu zhunanmcg...@gmail.com wrote:

 Great! Congratulations!
 
 -- 
 Nan Zhu
 On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:
 
 Brilliant stuff ! Congrats all :-)
 This is indeed really heartening news !
 
 Regards,
 Mridul
 
 
 On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some 
 pretty cool news for the project, which is that we've been able to use 
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
 faster on 10x fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few 
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
 providing the machines to make this possible. Finally, this result would of 
 course not be possible without the many many other contributions, testing 
 and feature requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and 
 it's thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 



How to save Spark log into file

2014-10-03 Thread arthur.hk.c...@gmail.com
Hi,

How can the spark log be saved into file instead of showing them on console? 

Below is my conf/log4j.properties

conf/log4j.properties
###
# Root logger option
log4j.rootLogger=INFO, file

# Direct log messages to a log file
log4j.appender.file=org.apache.log4j.RollingFileAppender

#Redirect to Tomcat logs folder
log4j.appender.file.File=/hadoop_logs/spark/spark_logging.log
log4j.appender.file.MaxFileSize=10MB
log4j.appender.file.MaxBackupIndex=10
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss} %-5p 
%c{1}:%L - %m%n
###


I tried to stop and start spark again, it still shows INFO WARN log on console.
Any ideas?

Regards
Arthur






scala hiveContext.hql(show tables)
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
2014-10-03 19:35:01,554 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(179)) - Parsing command: show tables
2014-10-03 19:35:01,715 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(197)) - Parse Completed
2014-10-03 19:35:01,845 INFO  [main] Configuration.deprecation 
(Configuration.java:warnOnceIfDeprecated(1009)) - mapred.input.dir.recursive is 
deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
2014-10-03 19:35:01,847 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - PERFLOG method=Driver.run
2014-10-03 19:35:01,847 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - PERFLOG method=TimeToSubmit
2014-10-03 19:35:01,847 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - PERFLOG method=compile
2014-10-03 19:35:01,863 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - PERFLOG method=parse
2014-10-03 19:35:01,863 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(179)) - Parsing command: show tables
2014-10-03 19:35:01,863 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(197)) - Parse Completed
2014-10-03 19:35:01,863 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - /PERFLOG method=parse start=1412336101863 
end=1412336101863 duration=0
2014-10-03 19:35:01,863 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - PERFLOG method=semanticAnalyze
2014-10-03 19:35:01,941 INFO  [main] ql.Driver (Driver.java:compile(450)) - 
Semantic Analysis Completed
2014-10-03 19:35:01,941 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - /PERFLOG method=semanticAnalyze 
start=1412336101863 end=1412336101941 duration=78
2014-10-03 19:35:01,979 INFO  [main] exec.ListSinkOperator 
(Operator.java:initialize(338)) - Initializing Self 0 OP
2014-10-03 19:35:01,980 INFO  [main] exec.ListSinkOperator 
(Operator.java:initializeChildren(403)) - Operator 0 OP initialized
2014-10-03 19:35:01,980 INFO  [main] exec.ListSinkOperator 
(Operator.java:initialize(378)) - Initialization Done 0 OP
2014-10-03 19:35:01,985 INFO  [main] ql.Driver (Driver.java:getSchema(264)) - 
Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, 
type:string, comment:from deserializer)], properties:null)
2014-10-03 19:35:01,985 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - /PERFLOG method=compile 
start=1412336101847 end=1412336101985 duration=138
2014-10-03 19:35:01,985 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - PERFLOG method=Driver.execute
2014-10-03 19:35:01,985 INFO  [main] Configuration.deprecation 
(Configuration.java:warnOnceIfDeprecated(1009)) - mapred.job.name is 
deprecated. Instead, use mapreduce.job.name
2014-10-03 19:35:01,986 INFO  [main] ql.Driver (Driver.java:execute(1117)) - 
Starting command: show tables
2014-10-03 19:35:01,994 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - /PERFLOG method=TimeToSubmit 
start=1412336101847 end=1412336101994 duration=147
2014-10-03 19:35:01,994 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - PERFLOG method=runTasks
2014-10-03 19:35:01,994 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - PERFLOG method=task.DDL.Stage-0
2014-10-03 19:35:02,019 INFO  [main] metastore.HiveMetaStore 
(HiveMetaStore.java:newRawStore(411)) - 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
2014-10-03 19:35:02,034 INFO  [main] metastore.ObjectStore 
(ObjectStore.java:initialize(232)) - ObjectStore, initialize called
2014-10-03 19:35:02,084 WARN  [main] DataNucleus.General 
(Log4JLogger.java:warn(96)) - Plugin (Bundle) org.datanucleus.api.jdo is 
already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
file://hadoop/spark/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar is already 
registered, and you are trying to register an identical plugin located at URL 
file://hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar.
2014-10-03 19:35:02,108 WARN  [main] DataNucleus.General 
(Log4JLogger.java:warn(96)) - Plugin (Bundle) org.datanucleus is already 
registered. Ensure you dont have multiple JAR 

object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi, 

I have tried to to run HBaseTest.scala, but I  got following errors, any ideas 
to how to fix them?

Q1) 
scala package org.apache.spark.examples
console:1: error: illegal start of definition
   package org.apache.spark.examples


Q2) 
scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
console:31: error: object hbase is not a member of package org.apache.hadoop
   import org.apache.hadoop.hbase.mapreduce.TableInputFormat



Regards
Arthur

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
/artifactId\
+  /exclusion\
+  exclusion\
+groupIdorg.apache.hadoop/groupId\
+artifactIdhadoop-auth/artifactId\
+  /exclusion\
+  exclusion\
+groupIdorg.apache.hadoop/groupId\
+artifactIdhadoop-annotations/artifactId\
+  /exclusion\
+  exclusion\
+groupIdorg.apache.hadoop/groupId\
+artifactIdhadoop-hdfs/artifactId\
+  /exclusion\
+  exclusion\
+groupIdorg.apache.hbase/groupId\
+artifactIdhbase-hadoop1-compat/artifactId\
+  /exclusion\
+  exclusion\
+groupIdorg.apache.commons/groupId\
+artifactIdcommons-math/artifactId\
+  /exclusion\
+  exclusion\
+groupIdcom.sun.jersey/groupId\
+artifactIdjersey-core/artifactId\
+  /exclusion\
+  exclusion\
+groupIdorg.slf4j/groupId\
+artifactIdslf4j-api/artifactId\
+  /exclusion\
+  exclusion\
+groupIdcom.sun.jersey/groupId\
+artifactIdjersey-server/artifactId\
+  /exclusion\
+  exclusion\
+groupIdcom.sun.jersey/groupId\
+artifactIdjersey-core/artifactId\
+  /exclusion\
+  exclusion\
+groupIdcom.sun.jersey/groupId\
+artifactIdjersey-json/artifactId\
+  /exclusion\
+  exclusion\
+!-- hbase uses v2.4, which is better, but ...--\
+groupIdcommons-io/groupId\
+artifactIdcommons-io/artifactId\
+  /exclusion\
+/exclusions\
+  /dependency\
+  dependency\
+groupIdorg.apache.hbase/groupId\
+artifactIdhbase-hadoop-compat/artifactId\
+version$\{hbase.version\}/version\
+  /dependency\
+  dependency\
+groupIdorg.apache.hbase/groupId\
+artifactIdhbase-hadoop-compat/artifactId\
+version$\{hbase.version\}/version\
+typetest-jar/type\
+scopetest/scope\
+  /dependency\
 dependency\
   groupIdcom.twitter/groupId\
   artifactIdalgebird-core_$\{scala.binary.version\}/artifactId\
}Please advise.RegardsArthurOn 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote:Sparkexamples builds against hbase 0.94 by default.If you want to run against 0.98, see:SPARK-1297https://issues.apache.org/jira/browse/SPARK-1297CheersOn Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote:Hi,I have tried to to runHBaseTest.scala, but I got following errors, any ideas to how to fix them?Q1)scala package org.apache.spark.examplesconsole:1: error: illegal start of definition   package org.apache.spark.examplesQ2)scala import org.apache.hadoop.hbase.mapreduce.TableInputFormatconsole:31: error: object hbase is not a member of package org.apache.hadoop   import org.apache.hadoop.hbase.mapreduce.TableInputFormatRegardsArthur


Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

Thanks!

patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 110.
2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

Still got errors.

Regards
Arthur

On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote:

 spark-1297-v5.txt is level 0 patch
 
 Please use spark-1297-v5.txt
 
 Cheers
 
 On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 Thanks!!
 
 I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt 
 are good here,  but not spark-1297-v5.txt:
 
 
 $ patch -p1 -i spark-1297-v4.txt
 patching file examples/pom.xml
 
 $ patch -p1 -i spark-1297-v5.txt
 can't find file to patch at input line 5
 Perhaps you used the wrong -p or --strip option?
 The text leading up to this was:
 --
 |diff --git docs/building-with-maven.md docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- docs/building-with-maven.md
 |+++ docs/building-with-maven.md
 --
 File to patch: 
 
 
 
 
 
 
 Please advise.
 Regards
 Arthur
 
 
 
 On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 Spark examples builds against hbase 0.94 by default.
 
 If you want to run against 0.98, see:
 SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
 
 Cheers
 
 On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi, 
 
 I have tried to to run HBaseTest.scala, but I  got following errors, any 
 ideas to how to fix them?
 
 Q1) 
 scala package org.apache.spark.examples
 console:1: error: illegal start of definition
package org.apache.spark.examples
 
 
 Q2) 
 scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 console:31: error: object hbase is not a member of package 
 org.apache.hadoop
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 
 
 
 Regards
 Arthur
 
 
 
 



Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

My bad.  Tried again, worked.


patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml


Thanks!
Arthur

On 14 Sep, 2014, at 11:38 pm, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 Thanks!
 
 patch -p0 -i spark-1297-v5.txt
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 
 Still got errors.
 
 Regards
 Arthur
 
 On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 spark-1297-v5.txt is level 0 patch
 
 Please use spark-1297-v5.txt
 
 Cheers
 
 On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 Thanks!!
 
 I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt 
 are good here,  but not spark-1297-v5.txt:
 
 
 $ patch -p1 -i spark-1297-v4.txt
 patching file examples/pom.xml
 
 $ patch -p1 -i spark-1297-v5.txt
 can't find file to patch at input line 5
 Perhaps you used the wrong -p or --strip option?
 The text leading up to this was:
 --
 |diff --git docs/building-with-maven.md docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- docs/building-with-maven.md
 |+++ docs/building-with-maven.md
 --
 File to patch: 
 
 
 
 
 
 
 Please advise.
 Regards
 Arthur
 
 
 
 On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 Spark examples builds against hbase 0.94 by default.
 
 If you want to run against 0.98, see:
 SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
 
 Cheers
 
 On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi, 
 
 I have tried to to run HBaseTest.scala, but I  got following errors, any 
 ideas to how to fix them?
 
 Q1) 
 scala package org.apache.spark.examples
 console:1: error: illegal start of definition
package org.apache.spark.examples
 
 
 Q2) 
 scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 console:31: error: object hbase is not a member of package 
 org.apache.hadoop
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 
 
 
 Regards
 Arthur
 
 
 
 
 



Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

I applied the patch.

1) patched

$ patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml


2) Compilation result
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [1.550s]
[INFO] Spark Project Core  SUCCESS [1:32.175s]
[INFO] Spark Project Bagel ... SUCCESS [10.809s]
[INFO] Spark Project GraphX .. SUCCESS [31.435s]
[INFO] Spark Project Streaming ... SUCCESS [44.518s]
[INFO] Spark Project ML Library .. SUCCESS [48.992s]
[INFO] Spark Project Tools ... SUCCESS [7.028s]
[INFO] Spark Project Catalyst  SUCCESS [40.365s]
[INFO] Spark Project SQL . SUCCESS [43.305s]
[INFO] Spark Project Hive  SUCCESS [36.464s]
[INFO] Spark Project REPL  SUCCESS [20.319s]
[INFO] Spark Project YARN Parent POM . SUCCESS [1.032s]
[INFO] Spark Project YARN Stable API . SUCCESS [19.379s]
[INFO] Spark Project Hive Thrift Server .. SUCCESS [12.470s]
[INFO] Spark Project Assembly  SUCCESS [13.822s]
[INFO] Spark Project External Twitter  SUCCESS [9.566s]
[INFO] Spark Project External Kafka .. SUCCESS [12.848s]
[INFO] Spark Project External Flume Sink . SUCCESS [10.437s]
[INFO] Spark Project External Flume .. SUCCESS [14.554s]
[INFO] Spark Project External ZeroMQ . SUCCESS [9.994s]
[INFO] Spark Project External MQTT ... SUCCESS [8.684s]
[INFO] Spark Project Examples  SUCCESS [1:31.610s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 9:41.700s
[INFO] Finished at: Sun Sep 14 23:51:56 HKT 2014
[INFO] Final Memory: 83M/1071M
[INFO] 



3) testing:  
scala package org.apache.spark.examples
console:1: error: illegal start of definition
   package org.apache.spark.examples
   ^


scala import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HBaseAdmin

scala import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

scala import org.apache.spark._
import org.apache.spark._

scala object HBaseTest {
 | def main(args: Array[String]) {
 | val sparkConf = new SparkConf().setAppName(HBaseTest)
 | val sc = new SparkContext(sparkConf)
 | val conf = HBaseConfiguration.create()
 | // Other options for configuring scan behavior are available. More 
information available at
 | // 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html
 | conf.set(TableInputFormat.INPUT_TABLE, args(0))
 | // Initialize hBase table if necessary
 | val admin = new HBaseAdmin(conf)
 | if (!admin.isTableAvailable(args(0))) {
 | val tableDesc = new HTableDescriptor(args(0))
 | admin.createTable(tableDesc)
 | }
 | val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
 | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
 | classOf[org.apache.hadoop.hbase.client.Result])
 | hBaseRDD.count()
 | sc.stop()
 | }
 | }
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
defined module HBaseTest



Now only got error when trying to run package org.apache.spark.examples”

Please advise.
Regards
Arthur



On 14 Sep, 2014, at 11:41 pm, Ted Yu yuzhih...@gmail.com wrote:

 I applied the patch on master branch without rejects.
 
 If you use spark 1.0.2, use pom.xml attached to the JIRA.
 
 On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 Thanks!
 
 patch -p0 -i spark-1297-v5.txt
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 
 Still got errors.
 
 Regards
 Arthur
 
 On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 spark-1297-v5.txt is level 0 patch
 
 Please use spark-1297-v5.txt
 
 Cheers
 
 On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 Thanks!!
 
 I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4

unable to create new native thread

2014-09-11 Thread arthur.hk.c...@gmail.com
Hi 

I am trying the Spark sample program “SparkPi”, I got an error unable to 
create new native thread, how to resolve this?

14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 644)
14/09/11 21:36:16 INFO scheduler.TaskSetManager: Finished TID 643 in 43 ms on 
node1 (progress: 636/10)
14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 643)
14/09/11 21:36:16 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread 
[spark-akka.actor.default-dispatcher-16] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at 
scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672)
at 
scala.concurrent.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966)
at 
scala.concurrent.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829)
at 
scala.concurrent.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(AbstractDispatcher.scala:374)
at 
akka.dispatch.ExecutorServiceDelegate$class.execute(ThreadPoolBuilder.scala:212)
at 
akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:43)
at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:118)
at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:59)
at akka.actor.dungeon.Dispatch$class.sendMessage(Dispatch.scala:120)
at akka.actor.ActorCell.sendMessage(ActorCell.scala:338)
at akka.actor.Cell$class.sendMessage(ActorCell.scala:259)
at akka.actor.ActorCell.sendMessage(ActorCell.scala:338)
at akka.actor.LocalActorRef.$bang(ActorRef.scala:389)
at akka.actor.ActorRef.tell(ActorRef.scala:125)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:489)
at akka.actor.ActorCell.invoke(ActorCell.scala:455)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Regards
Arthur



Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi, 

What is your SBT command and the parameters?

Arthur


On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler sp...@tbonline.de wrote:

 Hello,
 
 I am writing a Spark App which is already working so far.
 Now I started to build also some UnitTests, but I am running into some 
 dependecy problems and I cannot find a solution right now. Perhaps someone 
 could help me.
 
 I build my Spark Project with SBT and it seems to be configured well, because 
 compiling, assembling and running the built jar with spark-submit are working 
 well.
 
 Now I started with the UnitTests, which I located under /src/test/scala.
 
 When I call test in sbt, I get the following:
 
 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered BlockManager
 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server
 [trace] Stack trace suppressed: run last test:test for the full output.
 [error] Could not run test test.scala.SetSuite: 
 java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
 [info] Run completed in 626 milliseconds.
 [info] Total number of tests run: 0
 [info] Suites: completed 0, aborted 0
 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
 [info] All tests passed.
 [error] Error during tests:
 [error] test.scala.SetSuite
 [error] (test:test) sbt.TestsFailedException: Tests unsuccessful
 [error] Total time: 3 s, completed 10.09.2014 12:22:06
 
 last test:test gives me the following:
 
  last test:test
 [debug] Running TaskDef(test.scala.SetSuite, 
 org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector])
 java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
at org.apache.spark.HttpServer.start(HttpServer.scala:54)
at 
 org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.SparkContext.init(SparkContext.scala:202)
at test.scala.SetSuite.init(SparkTest.scala:16)
 
 I also noticed right now, that sbt run is also not working:
 
 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server
 [error] (run-main-2) java.lang.NoClassDefFoundError: 
 javax/servlet/http/HttpServletResponse
 java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
at org.apache.spark.HttpServer.start(HttpServer.scala:54)
at 
 org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
at 
 org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
at 
 org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
at 
 org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at 
 org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.SparkContext.init(SparkContext.scala:202)
at 
 main.scala.PartialDuplicateScanner$.main(PartialDuplicateScanner.scala:29)
at main.scala.PartialDuplicateScanner.main(PartialDuplicateScanner.scala)
 
 Here is my Testprojekt.sbt file:
 
 name := Testprojekt
 
 version := 1.0
 
 scalaVersion := 2.10.4
 
 libraryDependencies ++= {
  Seq(
org.apache.lucene % lucene-core % 4.9.0,
org.apache.lucene % lucene-analyzers-common % 4.9.0,
org.apache.lucene % lucene-queryparser % 4.9.0,
(org.apache.spark %% spark-core % 1.0.2).
exclude(org.mortbay.jetty, servlet-api).
exclude(commons-beanutils, commons-beanutils-core).
exclude(commons-collections, commons-collections).
exclude(commons-collections, commons-collections).
exclude(com.esotericsoftware.minlog, minlog).
exclude(org.eclipse.jetty.orbit, javax.mail.glassfish).
exclude(org.eclipse.jetty.orbit, javax.transaction).
exclude(org.eclipse.jetty.orbit, javax.servlet)
  )
 }
 
 resolvers += Akka Repository at http://repo.akka.io/releases/;
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi,

May be you can take a look about the following.

http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html

Good luck.
Arthur

On 10 Sep, 2014, at 9:09 pm, arunshell87 shell.a...@gmail.com wrote:

 
 Hi,
 
 I too had tried SQL queries with joins, MINUS , subqueries etc but they did
 not work in Spark Sql. 
 
 I did not find any documentation on what queries work and what do not work
 in Spark SQL, may be we have to wait for the Spark book to be released in
 Feb-2015.
 
 I believe you can try HiveQL in Spark for your requirement.
 
 Thanks,
 Arun
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark and Shark

2014-09-01 Thread arthur.hk.c...@gmail.com
Hi,

I have installed Spark 1.0.2 and Shark 0.9.2 on Hadoop 2.4.1 (by compiling from 
source).

spark: 1.0.2
shark: 0.9.2
hadoop: 2.4.1
java: java version “1.7.0_67”
protobuf: 2.5.0


I have tried the smoke test in shark but got  
“java.util.NoSuchElementException” error,  can you please advise how to fix 
this?

shark create table x1 (a INT);
FAILED: Hive Internal Error: java.util.NoSuchElementException(null)
14/09/01 23:04:24 [main]: ERROR shark.SharkDriver: FAILED: Hive Internal Error: 
java.util.NoSuchElementException(null)
java.util.NoSuchElementException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:925)
at java.util.HashMap$ValueIterator.next(HashMap.java:950)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:8117)
at 
shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:150)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
at shark.SharkDriver.compile(SharkDriver.scala:215)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:340)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at shark.SharkCliDriver$.main(SharkCliDriver.scala:237)
at shark.SharkCliDriver.main(SharkCliDriver.scala)


spark-env.sh
#!/usr/bin/env bash
export CLASSPATH=$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop}
export 
SPARK_CLASSPATH=$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar
export SPARK_WORKER_MEMORY=2g
export HADOOP_HEAPSIZE=2000

spark-defaults.conf
spark.executor.memory   2048m
spark.shuffle.spill.compressfalse

shark-env.sh
#!/usr/bin/env bash
export SPARK_MEM=2g
export SHARK_MASTER_MEM=2g
SPARK_JAVA_OPTS= -Dspark.local.dir=/tmp 
SPARK_JAVA_OPTS+=-Dspark.kryoserializer.buffer.mb=10 
SPARK_JAVA_OPTS+=-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps 
export SPARK_JAVA_OPTS
export SHARK_EXEC_MODE=yarn
export 
SPARK_ASSEMBLY_JAR=$SCALA_HOME/assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar
export SHARK_ASSEMBLY_JAR=target/scala-2.10/shark_2.10-0.9.2.jar
export HIVE_CONF_DIR=$HIVE_HOME/conf
export SPARK_LIBPATH=$HADOOP_HOME/lib/native/
export SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native/
export 
SPARK_CLASSPATH=$SHARK_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar:$SHARK_HOME/lib/protobuf-java-2.5.0.jar


Regards
Arthur



Re: Spark Hive max key length is 767 bytes

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi Michael,

Thank you so much!!

I have tried to change the following key length from 256 to 255 and from 767 to 
766, it still didn’t work
alter table COLUMNS_V2 modify column COMMENT VARCHAR(255);
alter table INDEX_PARAMS modify column PARAM_KEY VARCHAR(255);
alter table SD_PARAMS modify column PARAM_KEY VARCHAR(255);
alter table SERDE_PARAMS modify column PARAM_KEY VARCHAR(255);
alter table TABLE_PARAMS modify column PARAM_KEY VARCHAR(255);
alter table TBLS modify column OWNER VARCHAR(766);
alter table PART_COL_STATS modify column PARTITION_NAME VARCHAR(766);
alter table PARTITION_KEYS modify column PKEY_TYPE VARCHAR(766);
alter table PARTITIONS modify column PART_NAME VARCHAR(766);

I use Hadoop 2.4.1 HBase 0.98.5 Hive 0.13, trying Spark 1.0.2 and Shark 0.9.2, 
and JDK1.6_45.

Some questions:
shark-0.9.2 is based on which Hive version?  is HBase 0.98.x OK? is Hive 0.13.1 
OK? and which Java?  (I use JDK1.6 at the moment, it seems not working)
spark-1.0.2 is based on which Hive version?  is HBase 0.98.x OK?  

Regards
Arthur 


On 30 Aug, 2014, at 1:40 am, Michael Armbrust mich...@databricks.com wrote:

 Spark SQL is based on Hive 12.  They must have changed the maximum key size 
 between 12 and 13.
 
 
 On Fri, Aug 29, 2014 at 4:38 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 
 Tried the same thing in HIVE directly without issue:
 
 HIVE:
 hive create table test_datatype2 (testbigint bigint );
 OK
 Time taken: 0.708 seconds
 
 hive drop table test_datatype2;
 OK
 Time taken: 23.272 seconds
 
 
 
 Then tried again in SPARK:
 scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 14/08/29 19:33:52 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 hiveContext: org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@395c7b94
 
 scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 res0: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[0] at RDD at SchemaRDD.scala:104
 == Query Plan ==
 Native command: executed by Hive
 
 scala hiveContext.hql(drop table test_datatype3)
 
 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while 
 adding/validating class(es) : Specified key was too long; max key length is 
 767 bytes
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
 too long; max key length is 767 bytes
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 
 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of 
 org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in 
 no possible candidates
 Error(s) were found while auto-creating/validating the datastore for classes. 
 The errors are printed in the log, and are attached to this exception.
 org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found 
 while auto-creating/validating the datastore for classes. The errors are 
 printed in the log, and are attached to this exception.
   at 
 org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609)
 
 
 Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: 
 Specified key was too long; max key length is 767 bytes
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 
 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.
 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only 
 so does not have its own datastore table.
 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.
 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only 
 so does not have its own datastore table.
 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.
 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only 
 so does not have its own datastore table.
 14/08/29 19:34:25 ERROR DataNucleus.Datastore: An exception was thrown while 
 adding/validating class(es) : Specified key was too long; max key length is 
 767 bytes
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
 too long; max key length is 767 bytes

Re: Spark Hive max key length is 767 bytes

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi,

Already done but still get the same error:

(I use HIVE 0.13.1 Spark 1.0.2, Hadoop 2.4.1)

Steps:
Step 1) mysql:
 
 alter database hive character set latin1;
Step 2) HIVE:
 hive create table test_datatype2 (testbigint bigint );
 OK
 Time taken: 0.708 seconds
 
 hive drop table test_datatype2;
 OK
 Time taken: 23.272 seconds
Step 3) scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 14/08/29 19:33:52 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 hiveContext: org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@395c7b94
 scala hiveContext.hql(“create table test_datatype3 (testbigint bigint)”)
 res0: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[0] at RDD at SchemaRDD.scala:104
 == Query Plan ==
 Native command: executed by Hive
 scala hiveContext.hql(drop table test_datatype3)
 
 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while 
 adding/validating class(es) : Specified key was too long; max key length is 
 767 bytes
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
 too long; max key length is 767 bytes
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 
 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of 
 org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in 
 no possible candidates
 Error(s) were found while auto-creating/validating the datastore for 
 classes. The errors are printed in the log, and are attached to this 
 exception.
 org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found 
 while auto-creating/validating the datastore for classes. The errors are 
 printed in the log, and are attached to this exception.
 at 
 org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609)
 
 
 Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: 
 Specified key was too long; max key length is 767 bytes
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)



Should I use HIVE 0.12.0 instead of HIVE 0.13.1?

Regards
Arthur

On 31 Aug, 2014, at 6:01 am, Denny Lee denny.g@gmail.com wrote:

 Oh, you may be running into an issue with your MySQL setup actually, try 
 running
 
 alter database metastore_db character set latin1
 
 so that way Hive (and the Spark HiveContext) can execute properly against the 
 metastore.
 
 
 On August 29, 2014 at 04:39:01, arthur.hk.c...@gmail.com 
 (arthur.hk.c...@gmail.com) wrote:
 
 Hi,
 
 
 Tried the same thing in HIVE directly without issue:
 
 HIVE:
 hive create table test_datatype2 (testbigint bigint );
 OK
 Time taken: 0.708 seconds
 
 hive drop table test_datatype2;
 OK
 Time taken: 23.272 seconds
 
 
 
 Then tried again in SPARK:
 scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 14/08/29 19:33:52 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 hiveContext: org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@395c7b94
 
 scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 res0: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[0] at RDD at SchemaRDD.scala:104
 == Query Plan ==
 Native command: executed by Hive
 
 scala hiveContext.hql(drop table test_datatype3)
 
 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while 
 adding/validating class(es) : Specified key was too long; max key length is 
 767 bytes
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
 too long; max key length is 767 bytes
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 
 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of 
 org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in 
 no possible candidates
 Error(s) were found while auto-creating/validating the datastore for 
 classes. The errors are printed in the log, and are attached to this 
 exception.
 org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found 
 while auto-creating/validating the datastore for classes. The errors are 
 printed in the log, and are attached to this exception.
 at 
 org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609)
 
 
 Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: 
 Specified key was too long; max key length is 767 bytes
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 
 14/08/29 19:34:17 INFO

Spark Master/Slave and HA

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi,

I have few questions about Spark Master and Slave setup:

Here, I have 5 Hadoop nodes (n1, n2, n3, n4, and n5 respectively), at the 
moment I run Spark under these nodes:
n1:Hadoop Active Name node, Hadoop Slave
Spark Active Master 
n2:Hadoop Standby Name Node,Hadoop Salve
Spark Slave
n3: Hadoop Salve
Spark Slave 
n4: Hadoop Salve
Spark Slave 
n5: Hadoop Salve
Spark Slave 

Questions:
Q1: If I set n1 as both Spark Master and Spark Slave, I cannot start the Spark 
Cluster. does it mean that, unlike Hadoop, I cannot use the same machine to be 
both MASTER and SLAVE in Spark?
n1:Hadoop Active Name node, Hadoop Slave
Spark Active Master Spark Slave (failed to Start Spark)
n2:Hadoop Standby Name Node,Hadoop Salve
Spark Slave
n3: Hadoop Salve
Spark Slave 
n4: Hadoop 
SalveSpark Slave 
n5: Hadoop Salve
Spark Slave 

Q2: I am planning Spark HA, what if I use n2 as Spark Standby Master and Spark 
Slave”? is Spark allowed to run Standby Master and Slave under same machine?
n1:Hadoop Active Name node, Hadoop Slave
Spark Active Master 
n2:Hadoop Standby Name Node,Hadoop SalveSpark 
Standby MasterSpark Slave 
n3: Hadoop Salve
Spark Slave 
n4: Hadoop Salve
Spark Slave 
n5:  Hadoop Salve   
Spark Slave 

Q3: Does the Spark Master node do actual computation work like a worker or just 
a pure monitoring node? 

Regards
Arthur

Spark and Shark Node: RAM Allocation

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi,

Is there any formula to calculate proper RAM allocation values for Spark and 
Shark based on Physical RAM, HADOOP and HBASE RAM usage?
e.g. if a node has 32GB physical RAM


spark-defaults.conf
spark.executor.memory   ?g

spark-env.sh
export SPARK_WORKER_MEMORY=?
export HADOOP_HEAPSIZE=?


shark-env.sh
export SPARK_MEM=?g
export SHARK_MASTER_MEM=?g

spark-defaults.conf
spark.executor.memory   ?g


Regards
Arthur




Re: Spark Hive max key length is 767 bytes

2014-08-29 Thread arthur.hk.c...@gmail.com
Hi,


Tried the same thing in HIVE directly without issue:

HIVE:
hive create table test_datatype2 (testbigint bigint );
OK
Time taken: 0.708 seconds

hive drop table test_datatype2;
OK
Time taken: 23.272 seconds



Then tried again in SPARK:
scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
14/08/29 19:33:52 INFO Configuration.deprecation: 
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
mapreduce.reduce.speculative
hiveContext: org.apache.spark.sql.hive.HiveContext = 
org.apache.spark.sql.hive.HiveContext@395c7b94

scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
res0: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:104
== Query Plan ==
Native command: executed by Hive

scala hiveContext.hql(drop table test_datatype3)

14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while 
adding/validating class(es) : Specified key was too long; max key length is 767 
bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
too long; max key length is 767 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of 
org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in no 
possible candidates
Error(s) were found while auto-creating/validating the datastore for classes. 
The errors are printed in the log, and are attached to this exception.
org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found while 
auto-creating/validating the datastore for classes. The errors are printed in 
the log, and are attached to this exception.
at 
org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609)


Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified 
key was too long; max key length is 767 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
14/08/29 19:34:25 ERROR DataNucleus.Datastore: An exception was thrown while 
adding/validating class(es) : Specified key was too long; max key length is 767 
bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
too long; max key length is 767 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)


Can anyone please help?

Regards
Arthur


On 29 Aug, 2014, at 12:47 pm, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 (Please ignore if duplicated) 
 
 
 Hi,
 
 I use Spark 1.0.2 with Hive 0.13.1
 
 I have already set the hive mysql database to latine1; 
 
 mysql:
 alter database hive character set latin1;
 
 Spark:
 scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 scala hiveContext.hql(create table test_datatype1 (testbigint bigint ))
 scala hiveContext.hql(drop table test_datatype1)
 
 
 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.
 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only 
 so does not have its own datastore table.
 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.
 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only 
 so does not have its own datastore table.
 14/08/29 12:31:59 ERROR DataNucleus.Datastore: An exception was thrown while 
 adding

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
] | \- javax.activation:activation:jar:1.1:compile
[INFO] +- org.apache.hbase:hbase-hadoop-compat:jar:0.98.5-hadoop1:compile
[INFO] |  \- org.apache.commons:commons-math:jar:2.1:compile
[INFO] +- 
org.apache.hbase:hbase-hadoop-compat:test-jar:tests:0.98.5-hadoop1:test
[INFO] +- com.twitter:algebird-core_2.10:jar:0.1.11:compile
[INFO] |  \- com.googlecode.javaewah:JavaEWAH:jar:0.6.6:compile
[INFO] +- org.scalatest:scalatest_2.10:jar:2.1.5:test
[INFO] |  \- org.scala-lang:scala-reflect:jar:2.10.4:compile
[INFO] +- org.scalacheck:scalacheck_2.10:jar:1.11.3:test
[INFO] |  \- org.scala-sbt:test-interface:jar:1.0:test
[INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile
[INFO] |  +- net.jpountz.lz4:lz4:jar:1.1.0:compile
[INFO] |  +- org.antlr:antlr:jar:3.2:compile
[INFO] |  +- com.googlecode.json-simple:json-simple:jar:1.1:compile
[INFO] |  +- org.yaml:snakeyaml:jar:1.6:compile
[INFO] |  +- edu.stanford.ppl:snaptree:jar:0.1:compile
[INFO] |  +- org.mindrot:jbcrypt:jar:0.3m:compile
[INFO] |  +- org.apache.thrift:libthrift:jar:0.7.0:compile
[INFO] |  |  \- javax.servlet:servlet-api:jar:2.5:compile
[INFO] |  +- org.apache.cassandra:cassandra-thrift:jar:1.2.6:compile
[INFO] |  \- com.github.stephenc:jamm:jar:0.2.5:compile
[INFO] \- com.github.scopt:scopt_2.10:jar:3.2.0:compile
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 2.151s
[INFO] Finished at: Thu Aug 28 18:12:45 HKT 2014
[INFO] Final Memory: 17M/479M
[INFO] 

RegardsArthurOn 28 Aug, 2014, at 12:22 pm, Ted Yu yuzhih...@gmail.com wrote:I forgot to include '-Dhadoop.version=2.4.1' in the command below.The modified command passed.You can verify the dependence on hbase 0.98 through this command:
mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree  dep.txtCheersOn Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote:
Looks like the patch given by that URL only had the last commit.I have attached pom.xml forspark-1.0.2 toSPARK-1297
You can download it and replaceexamples/pom.xml with the downloaded pom
I am running this command locally:mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean packageCheers

On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote:

Hi Ted,Thanks.Tried [patch -p1 -i 1893.patch]  (Hunk #1 FAILED at 45.)

Is this normal?RegardsArthurpatch -p1 -i 1893.patch

patching file examples/pom.xmlHunk #1 FAILED at 45.

Hunk #2 succeeded at 94 (offset -16 lines).1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

patching file examples/pom.xmlHunk #1 FAILED at 54.Hunk #2 FAILED at 72.

Hunk #3 succeeded at 122 (offset -49 lines).2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej

patching file docs/building-with-maven.md

patching file examples/pom.xmlHunk #1 succeeded at 122 (offset -40 lines).Hunk #2 succeeded at 195 (offset -40 lines).

On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote:

Can you use this command ?patch -p1 -i 1893.patchCheersOn Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote:


Hi Ted,I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one?


RegardsArthurwget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz


tar -vxf spark-1.0.2.tgzcd spark-1.0.2


wget https://github.com/apache/spark/pull/1893.patch


patch  1893.patchpatching file pom.xml


Hunk #1 FAILED at 45.Hunk #2 FAILED at 110.


2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rejpatching file pom.xml


Hunk #1 FAILED at 54.Hunk #2 FAILED at 72.


Hunk #3 FAILED at 171.3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej


can't find file to patch at input line 267Perhaps you should have used the -p or --strip option?


The text leading up to this was:--


||From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001


|From: tedyu yuzhih...@gmail.com|Date: Mon, 11 Aug 2014 15:57:46 -0700


|Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add| description to building-with-maven.md


||---| docs/building-with-maven.md | 3 +++


| 1 file changed, 3 insertions(+)|


|diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md


|index 672d0ef..f8bcd2b 100644|--- a/docs/building-with-maven.md


|+++ b/docs/building-with-maven.md--


File to patch:
On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:


You can get the patch from this URL:https://github.com/apache/spark/pull/1893.patch


BTW 0.98.5 has been released - you can specify0.98.5-hadoop2 in the pom.xml
CheersOn Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

I tried to start Spark but failed:

$ ./sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to 
/mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out
failed to launch org.apache.spark.deploy.master.Master:
  Failed to find Spark assembly in 
/mnt/hadoop/spark-1.0.2/assembly/target/scala-2.10/

$ ll assembly/
total 20
-rw-rw-r--. 1 hduser hadoop 11795 Jul 26 05:50 pom.xml
-rw-rw-r--. 1 hduser hadoop   507 Jul 26 05:50 README
drwxrwxr-x. 4 hduser hadoop  4096 Jul 26 05:50 src



Regards
Arthur



On 28 Aug, 2014, at 6:19 pm, Ted Yu yuzhih...@gmail.com wrote:

 I see 0.98.5 in dep.txt
 
 You should be good to go.
 
 
 On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 tried 
 mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests 
 dependency:tree  dep.txt
 
 Attached the dep. txt for your information. 
 
 
 Regards
 Arthur
 
 On 28 Aug, 2014, at 12:22 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 I forgot to include '-Dhadoop.version=2.4.1' in the command below.
 
 The modified command passed.
 
 You can verify the dependence on hbase 0.98 through this command:
 
 mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests 
 dependency:tree  dep.txt
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote:
 Looks like the patch given by that URL only had the last commit.
 
 I have attached pom.xml for spark-1.0.2 to SPARK-1297
 You can download it and replace examples/pom.xml with the downloaded pom
 
 I am running this command locally:
 
 mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted, 
 
 Thanks. 
 
 Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
 Is this normal?
 
 Regards
 Arthur
 
 
 patch -p1 -i 1893.patch
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file examples/pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 succeeded at 122 (offset -49 lines).
 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 succeeded at 122 (offset -40 lines).
 Hunk #2 succeeded at 195 (offset -40 lines).
 
 
 On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote:
 
 Can you use this command ?
 
 patch -p1 -i 1893.patch
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 I tried the following steps to apply the patch 1893 but got Hunk FAILED, 
 can you please advise how to get thru this error? or is my spark-1.0.2 
 source not the correct one?
 
 Regards
 Arthur
  
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 wget https://github.com/apache/spark/pull/1893.patch
 patch   1893.patch
 patching file pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
 patching file pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 FAILED at 171.
 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
 can't find file to patch at input line 267
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |
 |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
 |From: tedyu yuzhih...@gmail.com
 |Date: Mon, 11 Aug 2014 15:57:46 -0700
 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
 | description to building-with-maven.md
 |
 |---
 | docs/building-with-maven.md | 3 +++
 | 1 file changed, 3 insertions(+)
 |
 |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- a/docs/building-with-maven.md
 |+++ b/docs/building-with-maven.md
 --
 File to patch:
 
 
 
 On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:
 
 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch
 
 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the 
 pom.xml
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 Thank you so much!!
 
 As I am new to Spark, can you please advise the steps about how to apply 
 this patch to my spark-1.0.2 source folder?
 
 Regards
 Arthur
 
 
 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:
 
 See SPARK-1297
 
 The pull request is here:
 https://github.com/apache/spark/pull/1893
 
 
 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
 ignore if duplicated

SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

I have just tried to apply the patch of SPARK-1297: 
https://issues.apache.org/jira/browse/SPARK-1297

There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt 
respectively.

When applying the 2nd one, I got Hunk #1 FAILED at 45

Can you please advise how to fix it in order to make the compilation of Spark 
Project Examples success?
(Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)

Regards
Arthur



patch -p1 -i spark-1297-v4.txt 
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 succeeded at 94 (offset -16 lines).
1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

below is the content of examples/pom.xml.rej:
+++ examples/pom.xml
@@ -45,6 +45,39 @@
 /dependency
   /dependencies
 /profile
+profile
+  idhbase-hadoop2/id
+  activation
+property
+  namehbase.profile/name
+  valuehadoop2/value
+/property
+  /activation
+  properties
+protobuf.version2.5.0/protobuf.version
+hbase.version0.98.4-hadoop2/hbase.version
+  /properties
+  dependencyManagement
+dependencies
+/dependencies
+  /dependencyManagement
+/profile
+profile
+  idhbase-hadoop1/id
+  activation
+property
+  name!hbase.profile/name
+/property
+  /activation
+  properties
+hbase.version0.98.4-hadoop1/hbase.version
+  /properties
+  dependencyManagement
+dependencies
+/dependencies
+  /dependencyManagement
+/profile
+
   /profiles
   
   dependencies


This caused the related compilation failed:
[INFO] Spark Project Examples  FAILURE [0.102s]




Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

patch -p1 -i spark-1297-v5.txt 
can't find file to patch at input line 5
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|diff --git docs/building-with-maven.md docs/building-with-maven.md
|index 672d0ef..f8bcd2b 100644
|--- docs/building-with-maven.md
|+++ docs/building-with-maven.md
--
File to patch: 

Please advise
Regards
Arthur



On 29 Aug, 2014, at 12:50 am, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 I have just tried to apply the patch of SPARK-1297: 
 https://issues.apache.org/jira/browse/SPARK-1297
 
 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt 
 respectively.
 
 When applying the 2nd one, I got Hunk #1 FAILED at 45
 
 Can you please advise how to fix it in order to make the compilation of Spark 
 Project Examples success?
 (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)
 
 Regards
 Arthur
 
 
 
 patch -p1 -i spark-1297-v4.txt 
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 
 below is the content of examples/pom.xml.rej:
 +++ examples/pom.xml
 @@ -45,6 +45,39 @@
  /dependency
/dependencies
  /profile
 +profile
 +  idhbase-hadoop2/id
 +  activation
 +property
 +  namehbase.profile/name
 +  valuehadoop2/value
 +/property
 +  /activation
 +  properties
 +protobuf.version2.5.0/protobuf.version
 +hbase.version0.98.4-hadoop2/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +profile
 +  idhbase-hadoop1/id
 +  activation
 +property
 +  name!hbase.profile/name
 +/property
 +  /activation
 +  properties
 +hbase.version0.98.4-hadoop1/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +
/profiles

dependencies
 
 
 This caused the related compilation failed:
 [INFO] Spark Project Examples  FAILURE [0.102s]
 
 



Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi Ted,

I downloaded pom.xml to examples directory.
It works, thanks!!

Regards
Arthur


[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [2.119s]
[INFO] Spark Project Core  SUCCESS [1:27.100s]
[INFO] Spark Project Bagel ... SUCCESS [10.261s]
[INFO] Spark Project GraphX .. SUCCESS [31.332s]
[INFO] Spark Project ML Library .. SUCCESS [35.226s]
[INFO] Spark Project Streaming ... SUCCESS [39.135s]
[INFO] Spark Project Tools ... SUCCESS [6.469s]
[INFO] Spark Project Catalyst  SUCCESS [36.521s]
[INFO] Spark Project SQL . SUCCESS [35.488s]
[INFO] Spark Project Hive  SUCCESS [35.296s]
[INFO] Spark Project REPL  SUCCESS [18.668s]
[INFO] Spark Project YARN Parent POM . SUCCESS [0.583s]
[INFO] Spark Project YARN Stable API . SUCCESS [15.989s]
[INFO] Spark Project Assembly  SUCCESS [11.497s]
[INFO] Spark Project External Twitter  SUCCESS [8.777s]
[INFO] Spark Project External Kafka .. SUCCESS [9.688s]
[INFO] Spark Project External Flume .. SUCCESS [10.411s]
[INFO] Spark Project External ZeroMQ . SUCCESS [9.511s]
[INFO] Spark Project External MQTT ... SUCCESS [8.451s]
[INFO] Spark Project Examples  SUCCESS [1:40.240s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 8:33.350s
[INFO] Finished at: Fri Aug 29 01:58:00 HKT 2014
[INFO] Final Memory: 82M/1086M
[INFO] 


On 29 Aug, 2014, at 1:36 am, Ted Yu yuzhih...@gmail.com wrote:

 bq.  Spark 1.0.2
 
 For the above release, you can download pom.xml attached to the JIRA and 
 place it in examples directory
 
 I verified that the build against 0.98.4 worked using this command:
 
 mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 
 -DskipTests clean package
 
 Patch v5 is @ level 0 - you don't need to use -p1 in the patch command.
 
 Cheers
 
 
 On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 I have just tried to apply the patch of SPARK-1297: 
 https://issues.apache.org/jira/browse/SPARK-1297
 
 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt 
 respectively.
 
 When applying the 2nd one, I got Hunk #1 FAILED at 45
 
 Can you please advise how to fix it in order to make the compilation of Spark 
 Project Examples success?
 (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)
 
 Regards
 Arthur
 
 
 
 patch -p1 -i spark-1297-v4.txt 
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 
 below is the content of examples/pom.xml.rej:
 +++ examples/pom.xml
 @@ -45,6 +45,39 @@
  /dependency
/dependencies
  /profile
 +profile
 +  idhbase-hadoop2/id
 +  activation
 +property
 +  namehbase.profile/name
 +  valuehadoop2/value
 +/property
 +  /activation
 +  properties
 +protobuf.version2.5.0/protobuf.version
 +hbase.version0.98.4-hadoop2/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +profile
 +  idhbase-hadoop1/id
 +  activation
 +property
 +  name!hbase.profile/name
 +/property
 +  /activation
 +  properties
 +hbase.version0.98.4-hadoop1/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +
/profiles

dependencies
 
 
 This caused the related compilation failed:
 [INFO] Spark Project Examples  FAILURE [0.102s]
 
 
 



org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and 
HBase.
With default setting in Spark 1.0.2, when trying to load a file I got Class 
org.apache.hadoop.io.compress.SnappyCodec not found

Can you please advise how to enable snappy in Spark?

Regards
Arthur


scala inFILE.first()
java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.RDD.take(RDD.scala:983)
at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
at $iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC.init(console:22)
at $iwC.init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 55 more
Caused by: java.lang.IllegalArgumentException: Compression codec   
org.apache.hadoop.io.compress.SnappyCodec not found.
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
at 

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

my check native result:

hadoop checknative
14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize 
native-bzip2 library system-native, will use pure-Java version
14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded  initialized 
native-zlib library
Native library checking:
hadoop: true 
/mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libhadoop.so
zlib:   true /lib64/libz.so.1
snappy: true 
/mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libsnappy.so.1
lz4:true revision:99
bzip2:  false

Any idea how to enable or disable  snappy in Spark?

Regards
Arthur


On 29 Aug, 2014, at 2:39 am, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and 
 HBase.
 With default setting in Spark 1.0.2, when trying to load a file I got Class 
 org.apache.hadoop.io.compress.SnappyCodec not found
 
 Can you please advise how to enable snappy in Spark?
 
 Regards
 Arthur
 
 
 scala inFILE.first()
 java.lang.RuntimeException: Error in configuring object
   at 
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
   at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.RDD.take(RDD.scala:983)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.reflect.InvocationTargetException

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 55 more
Caused by: java.lang.IllegalArgumentException: Compression codec   
org.apache.hadoop.io.compress.GzipCodec not found.
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.init(CompressionCodecFactory.java:175)
at 
org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
... 60 more
Caused by: java.lang.ClassNotFoundException: Class   
org.apache.hadoop.io.compress.GzipCodec not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128)
... 62 more


Any idea to fix this issue?
Regards
Arthur


On 29 Aug, 2014, at 2:58 am, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 my check native result:
 
 hadoop checknative
 14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize 
 native-bzip2 library system-native, will use pure-Java version
 14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded  initialized 
 native-zlib library
 Native library checking:
 hadoop: true 
 /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libhadoop.so
 zlib:   true /lib64/libz.so.1
 snappy: true 
 /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libsnappy.so.1
 lz4:true revision:99
 bzip2:  false
 
 Any idea how to enable or disable  snappy in Spark?
 
 Regards
 Arthur
 
 
 On 29 Aug, 2014, at 2:39 am, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 
 Hi,
 
 I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and 
 HBase.
 With default setting in Spark 1.0.2, when trying to load a file I got Class 
 org.apache.hadoop.io.compress.SnappyCodec not found
 
 Can you please advise how to enable snappy in Spark?
 
 Regards
 Arthur
 
 
 scala inFILE.first()
 java.lang.RuntimeException: Error in configuring object
  at 
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
  at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
  at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
  at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
  at org.apache.spark.rdd.RDD.take(RDD.scala:983)
  at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
  at $iwC$$iwC$$iwC$$iwC.init(console:15)
  at $iwC$$iwC$$iwC.init(console:20)
  at $iwC$$iwC.init(console:22)
  at $iwC.init(console:24)
  at init(console:26)
  at .init(console:30)
  at .clinit(console)
  at .init(console:7)
  at .clinit(console)
  at $print(console)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
  at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
  at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
  at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
  at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
  at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
  at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
  at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
  at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
  at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
  at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,
 
I fixed the issue by copying libsnappy.so to Java ire.

Regards
Arthur

On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 If change my etc/hadoop/core-site.xml 
 
 from 
property
 nameio.compression.codecs/name
 value
   org.apache.hadoop.io.compress.SnappyCodec,
   org.apache.hadoop.io.compress.GzipCodec,
   org.apache.hadoop.io.compress.DefaultCodec,
   org.apache.hadoop.io.compress.BZip2Codec
 /value
/property
 
 to 
property
 nameio.compression.codecs/name
 value
   org.apache.hadoop.io.compress.GzipCodec,
   org.apache.hadoop.io.compress.SnappyCodec,
   org.apache.hadoop.io.compress.DefaultCodec,
   org.apache.hadoop.io.compress.BZip2Codec
 /value
/property
 
 
 
 and run the test again, I found this time it cannot find 
 org.apache.hadoop.io.compress.GzipCodec
 
 scala inFILE.first()
 java.lang.RuntimeException: Error in configuring object
   at 
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
   at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.RDD.take(RDD.scala:983)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke

Spark Hive max key length is 767 bytes

2014-08-28 Thread arthur.hk.c...@gmail.com
(Please ignore if duplicated) 


Hi,

I use Spark 1.0.2 with Hive 0.13.1

I have already set the hive mysql database to latine1; 

mysql:
alter database hive character set latin1;

Spark:
scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala hiveContext.hql(create table test_datatype1 (testbigint bigint ))
scala hiveContext.hql(drop table test_datatype1)


14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
14/08/29 12:31:59 ERROR DataNucleus.Datastore: An exception was thrown while 
adding/validating class(es) : Specified key was too long; max key length is 767 
bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
too long; max key length is 767 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:408)
at com.mysql.jdbc.Util.getInstance(Util.java:383)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4226)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4158)

Can you please advise what would be wrong?

Regards
Arthur

Compilaon Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi,

I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 
0.98, 

My steps:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
tar -vxf spark-1.0.2.tgz
cd spark-1.0.2

edit project/SparkBuild.scala, set HBASE_VERSION
  // HBase version; set as appropriate.
  val HBASE_VERSION = 0.98.2


edit pom.xml with following values
hadoop.version2.4.1/hadoop.version
protobuf.version2.5.0/protobuf.version
yarn.version${hadoop.version}/yarn.version
hbase.version0.98.5/hbase.version
zookeeper.version3.4.6/zookeeper.version
hive.version0.13.1/hive.version


SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2

Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I 
set HBASE_VERSION back to “0.94.6?

Regards
Arthur




[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: org.apache.hbase#hbase;0.98.2: not found
[warn]  ::

sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not 
found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
at 
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
at xsbt.boot.Using$.withResource(Using.scala:11)
at xsbt.boot.Using$.apply(Using.scala:10)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
at sbt.IvySbt.withIvy(Ivy.scala:101)
at sbt.IvySbt.withIvy(Ivy.scala:97)
at sbt.IvySbt$Module.withModule(Ivy.scala:116)
at sbt.IvyActions$.update(IvyActions.scala:125)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
at sbt.std.Transform$$anon$4.work(System.scala:64)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.Execute.work(Execute.scala:244)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
[error] (examples/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.hbase#hbase;0.98.2: not found
[error] 

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted,

Thank you so much!!

As I am new to Spark, can you please advise the steps about how to apply this 
patch to my spark-1.0.2 source folder?

Regards
Arthur


On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:

 See SPARK-1297
 
 The pull request is here:
 https://github.com/apache/spark/pull/1893
 
 
 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
 ignore if duplicated)
 
 
 Hi,
 
 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with 
 HBase 0.98,
 
 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 
 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2
 
 
 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version
 
 
 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2
 
 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I 
 set HBASE_VERSION back to “0.94.6?
 
 Regards
 Arthur
 
 
 
 
 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::
 
 sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: 
 not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
 at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
 at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
 at 
 xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
 at 
 xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
 at xsbt.boot.Using$.withResource(Using.scala:11)
 at xsbt.boot.Using$.apply(Using.scala:10)
 at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
 at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
 at xsbt.boot.Locks$.apply0(Locks.scala:31)
 at xsbt.boot.Locks$.apply(Locks.scala:28)
 at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
 at sbt.IvySbt.withIvy(Ivy.scala:101)
 at sbt.IvySbt.withIvy(Ivy.scala:97)
 at sbt.IvySbt$Module.withModule(Ivy.scala:116)
 at sbt.IvyActions$.update(IvyActions.scala:125)
 at 
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
 at 
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
 at 
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
 at 
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
 at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
 at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
 at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
 at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
 at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
 at sbt.std.Transform$$anon$4.work(System.scala:64)
 at 
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
 at 
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
 at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
 at sbt.Execute.work(Execute.scala:244)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
 at 
 sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
 at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted,

I tried the following steps to apply the patch 1893 but got Hunk FAILED, can 
you please advise how to get thru this error? or is my spark-1.0.2 source not 
the correct one?

Regards
Arthur
 
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
tar -vxf spark-1.0.2.tgz
cd spark-1.0.2
wget https://github.com/apache/spark/pull/1893.patch
patch   1893.patch
patching file pom.xml
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 110.
2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
patching file pom.xml
Hunk #1 FAILED at 54.
Hunk #2 FAILED at 72.
Hunk #3 FAILED at 171.
3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
can't find file to patch at input line 267
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--
|
|From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
|From: tedyu yuzhih...@gmail.com
|Date: Mon, 11 Aug 2014 15:57:46 -0700
|Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
| description to building-with-maven.md
|
|---
| docs/building-with-maven.md | 3 +++
| 1 file changed, 3 insertions(+)
|
|diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
|index 672d0ef..f8bcd2b 100644
|--- a/docs/building-with-maven.md
|+++ b/docs/building-with-maven.md
--
File to patch:



On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:

 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch
 
 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 Thank you so much!!
 
 As I am new to Spark, can you please advise the steps about how to apply this 
 patch to my spark-1.0.2 source folder?
 
 Regards
 Arthur
 
 
 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:
 
 See SPARK-1297
 
 The pull request is here:
 https://github.com/apache/spark/pull/1893
 
 
 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
 ignore if duplicated)
 
 
 Hi,
 
 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with 
 HBase 0.98,
 
 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 
 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2
 
 
 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version
 
 
 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2
 
 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should 
 I set HBASE_VERSION back to “0.94.6?
 
 Regards
 Arthur
 
 
 
 
 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::
 
 sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: 
 not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
 at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
 at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
 at 
 xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
 at 
 xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
 at xsbt.boot.Using$.withResource(Using.scala:11)
 at xsbt.boot.Using$.apply(Using.scala:10)
 at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
 at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
 at xsbt.boot.Locks$.apply0(Locks.scala:31)
 at xsbt.boot.Locks$.apply(Locks.scala:28)
 at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
 at sbt.IvySbt.withIvy(Ivy.scala:101)
 at sbt.IvySbt.withIvy(Ivy.scala:97)
 at sbt.IvySbt$Module.withModule(Ivy.scala:116)
 at sbt.IvyActions$.update(IvyActions.scala:125)
 at 
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted, 

Thanks. 

Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
Is this normal?

Regards
Arthur


patch -p1 -i 1893.patch
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 succeeded at 94 (offset -16 lines).
1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
patching file examples/pom.xml
Hunk #1 FAILED at 54.
Hunk #2 FAILED at 72.
Hunk #3 succeeded at 122 (offset -49 lines).
2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
patching file docs/building-with-maven.md
patching file examples/pom.xml
Hunk #1 succeeded at 122 (offset -40 lines).
Hunk #2 succeeded at 195 (offset -40 lines).


On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote:

 Can you use this command ?
 
 patch -p1 -i 1893.patch
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 I tried the following steps to apply the patch 1893 but got Hunk FAILED, can 
 you please advise how to get thru this error? or is my spark-1.0.2 source not 
 the correct one?
 
 Regards
 Arthur
  
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 wget https://github.com/apache/spark/pull/1893.patch
 patch   1893.patch
 patching file pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
 patching file pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 FAILED at 171.
 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
 can't find file to patch at input line 267
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |
 |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
 |From: tedyu yuzhih...@gmail.com
 |Date: Mon, 11 Aug 2014 15:57:46 -0700
 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
 | description to building-with-maven.md
 |
 |---
 | docs/building-with-maven.md | 3 +++
 | 1 file changed, 3 insertions(+)
 |
 |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- a/docs/building-with-maven.md
 |+++ b/docs/building-with-maven.md
 --
 File to patch:
 
 
 
 On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:
 
 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch
 
 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 Thank you so much!!
 
 As I am new to Spark, can you please advise the steps about how to apply 
 this patch to my spark-1.0.2 source folder?
 
 Regards
 Arthur
 
 
 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:
 
 See SPARK-1297
 
 The pull request is here:
 https://github.com/apache/spark/pull/1893
 
 
 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
 ignore if duplicated)
 
 
 Hi,
 
 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with 
 HBase 0.98,
 
 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 
 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2
 
 
 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version
 
 
 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2
 
 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should 
 I set HBASE_VERSION back to “0.94.6?
 
 Regards
 Arthur
 
 
 
 
 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hbase#hbase;0.98.2: not found
 [warn]  ::
 
 sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: 
 not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
 at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
 at sbt.IvySbt$$anon$3.call(Ivy.scala:60

Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi,

I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1.

I just tried to compile Spark 1.0.2, but got error on Spark Project Hive, can 
you please advise which repository has 
org.spark-project.hive:hive-metastore:jar:0.13.1?


FYI, below is my repository setting in maven which would be old:
repository
   idnexus/id
   namelocal private nexus/name
   urlhttp://maven.oschina.net/content/groups/public//url
   releases
 enabledtrue/enabled
   /releases
   snapshots
 enabledfalse/enabled
   /snapshots
 /repository


export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package

[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [1.892s]
[INFO] Spark Project Core  SUCCESS [1:33.698s]
[INFO] Spark Project Bagel ... SUCCESS [12.270s]
[INFO] Spark Project GraphX .. SUCCESS [2:16.343s]
[INFO] Spark Project ML Library .. SUCCESS [4:18.495s]
[INFO] Spark Project Streaming ... SUCCESS [39.765s]
[INFO] Spark Project Tools ... SUCCESS [9.173s]
[INFO] Spark Project Catalyst  SUCCESS [35.462s]
[INFO] Spark Project SQL . SUCCESS [1:16.118s]
[INFO] Spark Project Hive  FAILURE [1:36.816s]
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project YARN Parent POM . SKIPPED
[INFO] Spark Project YARN Stable API . SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project Examples  SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 12:40.616s
[INFO] Finished at: Thu Aug 28 11:41:07 HKT 2014
[INFO] Final Memory: 37M/826M
[INFO] 
[ERROR] Failed to execute goal on project spark-hive_2.10: Could not resolve 
dependencies for project org.apache.spark:spark-hive_2.10:jar:1.0.2: The 
following artifacts could not be resolved: 
org.spark-project.hive:hive-metastore:jar:0.13.1, 
org.spark-project.hive:hive-exec:jar:0.13.1, 
org.spark-project.hive:hive-serde:jar:0.13.1: Could not find artifact 
org.spark-project.hive:hive-metastore:jar:0.13.1 in nexus-osc 
(http://maven.oschina.net/content/groups/public/) - [Help 1]


Regards
Arthur