-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
Created issue: https://issues.apache.org/jira/browse/SPARK-6707
I would really appreciate ideas/views/opinions on this feature.
- -- Ankur Chauhan
On 03/04/2015 13:23, Tim Chen wrote:
Hi Ankur,
There isn't a way to do that yet, but it's
Hi I have tried with parallelize but i got the below exception
java.io.NotSerializableException: pacific.dr.VendorRecord
Here is my code
ListVendorRecord
vendorRecords=blockingKeys.getMatchingRecordsWithscan(matchKeysOutput);
JavaRDDVendorRecord lines = sc.parallelize(vendorRecords)
On 2
How is spark faster than MR when data is in disk in both cases?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
It shouldn't be too bad - pertinent changes migration notes are here:
http://spark.apache.org/docs/1.0.0/programming-guide.html#migrating-from-pre-10-versions-of-spark
for pre-1.0 and here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13
for
Hi Spark Users,
I'm testing 1.3 new feature of parquet partition discovery.
I have 2 sub folders, each has 800 rows.
/data/table1/key=1
/data/table1/key=2
In spark-shell, run this command:
val t = sqlContext.createExternalTable(table1, hdfs:///data/table1,
parquet)
t.count
It shows 1600
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of
cores.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-Does-it-make-sense-tp22360p22374.html
Sent from the Apache Spark User List mailing list archive
If data is on HDFS, it is not read any more or less quickly by either
framework. Both are in fact using the same logic to exploit locality,
and read and deserialize data anyway. I don't think this is what
anyone claims though.
Spark can be faster in a multi-stage operation, which would require
Hi ,
I am trying to run the following command in the Movie Recommendation example
provided by the ampcamp tutorial
Command: sbt package run /movielens/medium
Exception: sbt.TrapExitSecurityException thrown from the
UncaughtExceptionHandler in thread run-main-0
java.lang.RuntimeException:
Here is my conf object passing first parameter of API.
but here I want to pass multiple scan means i have 4 criteria for STRAT ROW
and STOROW in same table.
by using below code i can get result for one STARTROW and ENDROW.
Configuration conf = DBConfiguration.getConf();
// int scannerTimeout =
Avoiding maintaining a separate Hive version is one of the initial
purpose of Spark SQL. (We had once done this for Shark.) The
org.spark-project.hive:hive-0.13.1a artifact only cleans up some 3rd
dependencies to avoid dependency hell in Spark. This artifact is exactly
the same as Hive 0.13.1
I think this is a bug of Spark SQL dates back to at least 1.1.0.
The json_tuple function is implemented as
org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The
ClassNotFoundException should complain with the class name rather than
the UDTF function name.
The problematic line
We've a custom version/build of sparktreaming doing the nested s3 lookups
faster (uses native S3 APIs). You can find the source code over here :
https://github.com/sigmoidanalytics/spark-modified, In particular the
changes from here
Hi All,
Can we get the result of the multiple scan
from JavaSparkContext.newAPIHadoopRDD from Hbase.
This method first parameter take configuration object where I have added
filter. but how Can I query multiple scan from same table calling this API
only once?
regards
jeetendra
Hey Xudong,
We had been digging this issue for a while, and believe PR 5339
http://github.com/apache/spark/pull/5339 and PR 5334
http://github.com/apache/spark/pull/5339 should fix this issue.
There two problems:
1. Normally we cache Parquet table metadata for better performance, but
when
Filed https://issues.apache.org/jira/browse/SPARK-6708 to track this.
Cheng
On 4/4/15 10:21 PM, Cheng Lian wrote:
I think this is a bug of Spark SQL dates back to at least 1.1.0.
The json_tuple function is implemented as
org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The
Without the rest of your code, it's hard to know what might be
unserializable.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
You need to refresh the external table manually after updating the data
source outside Spark SQL:
- via Scala API: sqlContext.refreshTable(table1)
- via SQL: REFRESH TABLE table1;
Cheng
On 4/4/15 5:24 PM, Rex Xiong wrote:
Hi Spark Users,
I'm testing 1.3 new feature of parquet partition
Hello,
I have a case class like this:
case class A(
m: Map[Long, Long],
...
)
and constructed a DataFrame from Seq[A].
I would like to perform a groupBy on A.m(SomeKey). I can implement a UDF,
create a new Column then invoke a groupBy on the new Column. But is it the
idiomatic way of doing
I have two questions:
1) In a Spark Streaming program, after the various DStream transformations have
being setup,
the ssc.start() method is called to start the computation.
Can the underlying DAG change (ie. add another map or maybe a join) after
ssc.start() has been
called (and maybe
UNSUBSCRIBE
On Sun, Apr 5, 2015 at 6:43 AM, nickos168 nickos...@yahoo.com.invalid
wrote:
I have two questions:
1) In a Spark Streaming program, after the various DStream transformations
have being setup,
the ssc.start() method is called to start the computation.
Can the underlying DAG
I am not sure whether this can be possible but i have tried something like
SELECT time, src, dst, sum(val1), sum(val2) from table group by
src,dst;
and it works.I think it will result the same answer as you are expecting
--
View this message in context:
Hi all,
More good news! I was able to utilize mergeStrategy to assembly my Kinesis
consumer into an uber jar
Here's what I added to* build.sbt:*
*mergeStrategy in assembly = (mergeStrategy in assembly) { (old) =*
* {*
* case PathList(com, esotericsoftware, minlog, xs @ _*) =
Hi all,
I am running some benchmarks on a simple Spark application which consists of
:
- textFileStream() to extract text records from HDFS files
- map() to parse records into JSON objects
- updateStateByKey() to calculate and store an in-memory state for each key.
The processing time per batch
Hi All,
I am trying to build spark 1.3.0 on Ubuntu 14.04 Stand alone machine. I am
using sbt/sbt assembly command to build it. However, this command works
pretty fine with spark version 1.1.0 but for Spark 1.3 it gives following
error.
Any help or suggestions to resolve this problem will highly
Hi All,
I am trying to build spark 1.3.O on standalone Ubuntu 14.04. I am using the
sbt command i.e. sbt/sbt assembly to build it. This command works pretty
good with spark version 1.1 however, it gives following error with spark
1.3.0. Any help or suggestions to resolve this would highly be
Use the MVN build instead. From the README in the git repo (
https://github.com/apache/spark)
mvn -DskipTests clean package
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler
Hi,
I am currently testing my application with Spark under local mode, and I
set the master to be local[4]. One thing I note is that when there is
groupBy/reduceBy operation involved, the CPU usage can sometimes be around
600% to 800%. I am wondering if this is expected? (As only 4 worker threads
27 matches
Mail list logo