We are trying to add 6 spare servers to our existing cluster. Those
machines have more CPU cores, more memory. Unfortunately, 3 of them can
only use 2.5” hard drives and total size of each node is about 7TB. The
other 3 nodes can only have 3.5” hard drives, but have 48TB each nodes. In
addition, th
Usually your executors were killed by YARN due to exceeding the memory. You
can change NodeManager's log to see if your application got killed. or use
command "yarn logs -applicationId " to download the logs.
On Thu, Dec 1, 2016 at 10:30 PM, Nisrina Luthfiyati <
nisrina.luthfiy...@gmail.com> wrote
e_etv_map#3807,
> demovalue_old_map#3808,map_type#3809]
>
>
> Thanks
> Swapnil
>
> On Sat, Nov 26, 2016 at 2:32 PM, Benyi Wang wrote:
>
>> Could you post the result of explain `c.explain`? If it is broadcast
>> join, you will see it in explain.
>>
>> On
Could you post the result of explain `c.explain`? If it is broadcast join,
you will see it in explain.
On Sat, Nov 26, 2016 at 10:51 AM, Swapnil Shinde
wrote:
> Hello
> I am trying a broadcast join on dataframes but it is still doing
> SortMergeJoin. I even try setting spark.sql.autoBroadcas
Below is my test code using Spark 2.0.1. DeserializeToObject doesn’t exist
in filter() but in map(). Does it means map() does not Tungsten operation?
case class Event(id: Long)
val e1 = Seq(Event(1L), Event(2L)).toDSval e2 = Seq(Event(2L), Event(3L)).toDS
e1.filter(e=>e.id < 10 && e.id > 5).expla
I had a problem when I used "spark.executor.userClassPathFirst" before. I
don't remember what the problem is.
Alternatively, you can use --driver-class-path and "--conf
spark.executor.extraClassPath". Unfortunately you may feel frustrated like
me when trying to make it work.
Depends on how you ru
I would say change
class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord]
extends FileInputFormat
to
class RawDataInputFormat[LongWritable, RDRawDataRecord] extends FileInputFormat
On Thu, Mar 17, 2016 at 9:48 AM, Mich Talebzadeh
wrote:
> Hi Tony,
>
> Is
>
> com.kiisoo.aegis.b
bal
> shuffle partition parameter
> df eventwkRepartitioned = eventwk.repartition(2)
> eventwkRepartitioned.registerTempTable("event_wk_repartitioned")
> and use this in your insert statement.
>
> registering temp table is cheap
>
> HTH
>
>
> On 29 January 2016 at 20:26, Benyi
I want to insert into a partition table using dynamic partition, but I
don’t want to have 200 files for a partition because the files will be
small for my case.
sqlContext.sql( """
|insert overwrite table event
|partition(eventDate)
|select
| user,
| detail,
| eventDate
Never mind.
GenericUDAFCollectList supports struct in 1.3.0. I modified it and it works
in a tricky way.
I also found an example HiveWindowFunction.
On Thu, Jan 28, 2016 at 12:49 PM, Benyi Wang wrote:
> I'm trying to implement a WindowFunction like collect_list, but I have to
>
I'm trying to implement a WindowFunction like collect_list, but I have to
collect a struct. collect_list works only for primitive type.
I think I might modify GenericUDAFCollectList, but haven't tried it yet.
I'm wondering if there is an example showing how to write a custom
WindowFunction in Spa
gt;>>> for shuffle read, I believe it is setting in the sql shuffle parallism.
>>>>
>>>> In the doc, it mentions:
>>>>
>>>>- Automatically determine the number of reducers for joins and
>>>>groupbys: Currently in Spark SQL, y
- I assume your parquet files are compressed. Gzip or Snappy?
- What spark version did you use? It seems at least 1.4. If you use
spark-sql and tungsten, you might have better performance. but spark 1.5.2
gave me a wrong result when the data was about 300~400GB, just for a simple
gro
I don't understand this: "I have the following method code which I call it
from a thread spawn from spark driver. So in this case 2000 threads ..."
Why do you call it from a thread?
Are you process one partition in one thread?
On Thu, Dec 10, 2015 at 11:13 AM, Benyi Wang wrot
DataFrame filterFrame1 =
sourceFrame.filter(col("col1").contains("xyz"));DataFrame
frameToProcess = sourceFrame.except(filterFrame1);
except is really expensive. Do you actually want this:
sourceFrame.filter(! col("col1").contains("xyz"))
On Thu, Dec 10, 2015 at 9:57 AM, unk1102 wrote:
> Hi
at 12:44 PM, Michael Armbrust
wrote:
> The user facing type mapping is documented here:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types
>
> On Fri, Oct 23, 2015 at 12:10 PM, Benyi Wang
> wrote:
>
>> If I have two columns
>>
>> Struct
If I have two columns
StructType(Seq(
StructField("id", LongType),
StructField("phones", ArrayType(StringType
I want to add index for “phones” before I explode it.
Can this be implemented as GenericUDF?
I tried DataFrame.explode. It worked for simple types like string, but I
could not f
I'm using spark-1.4.1 and compile it against CDH5.3.2. When I use
ALS.trainImplicit to build a model, I got this error when rank=40 and
iterations=30.
It worked for (rank=10, iteration=10) and (rank=20, iteration=20).
What was wrong with (rank=40, iterations=30)?
15/08/13 01:16:40 INFO sched
2.2.0.
On Fri, Aug 7, 2015 at 8:45 PM, Benyi Wang wrote:
> I'm trying to build spark 1.4.1 against CDH 5.3.2. I created a profile
> called cdh5.3.2 in spark_parent.pom, made some changes for
> sql/hive/v0.13.1, and the build finished successfully.
>
> Here is my proble
I'm trying to build spark 1.4.1 against CDH 5.3.2. I created a profile
called cdh5.3.2 in spark_parent.pom, made some changes for
sql/hive/v0.13.1, and the build finished successfully.
Here is my problem:
- If I run `mvn -Pcdh5.3.2,yarn,hive install`, the artifacts are
installed into my loc
You may try to change the schudlingMode to FAIR, the default is FIFO. Take
a look at this page
https://spark.apache.org/docs/1.1.0/job-scheduling.html#scheduling-within-an-application
On Sat, Jan 10, 2015 at 10:24 AM, YaoPau wrote:
> I'm looking for ways to reduce the runtime of my Spark job.
I got this error when I click "Track URL: ApplicationMaster" when I run a
spark job in YARN cluster mode. I found this jira
https://issues.apache.org/jira/browse/YARN-800, but I could not get this
problem fixed. I'm running CDH 5.1.0 with Both HDFS and RM HA enabled. Does
anybody has the similar is
When I have a multi-step process flow like this:
A -> B -> C -> D -> E -> F
I need to store B and D's results into parquet files
B.saveAsParquetFile
D.saveAsParquetFile
If I don't cache/persist any step, spark might recompute from A,B,C,D and E
if something is wrong in F.
Of course, I'd better
I'm using spark-1.0.0 in CDH 5.1.0. The big problem is SparkSQL doesn't
support Hash join in this version.
On Tue, Nov 4, 2014 at 10:54 PM, Akhil Das
wrote:
> How about Using SparkSQL <https://spark.apache.org/sql/>?
>
> Thanks
> Best Regards
>
> On Wed,
I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did,
# build (K,V) from A and B to prepare the join
val ja = A.map( r => (K1, Va))
val jb = B.map( r => (K1, Vb))
# join A, B
val jab = ja.join(jb)
# build (K,V) from the joined result of A and B to prepare joining with C
val jc = C.ma
I don't need to cache RDDs in my spark Application, but there is a big
shuffle in the data processing. I can always find Shuffle spill (memory)
and Shuffle spill (disk). I'm wondering if I can give more memory to
shuffle to avoid spill to disk.
export SPARK_JAVA_OPTS='-Dspark.shuffle.memoryFractio
I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in clouder'a
maven repository. After put it into the classpath, I can use spark-sql in
my application.
One of issue is that I couldn't make the join as a hash join. It gives
CartesianProduct when I join two SchemaRDDs as follows:
scal
scala> user
res19: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ParquetTableScan [id#0,name#1], (ParquetRelation
/user/hive/warehouse/user), None
scala> order
res20: org.apache.spark.sql.SchemaRDD =
SchemaRDD[72] at RDD at SchemaRDD.scala:98
== Query
I can do that in my application, but I really want to know how I can do it
in spark-shell because I usually prototype in spark-shell before I put the
code into an application.
On Wed, Aug 20, 2014 at 12:47 PM, Sameer Tilak wrote:
> Hi Wang,
> Have you tried doing this in your application?
>
>
I want to use opencsv's CSVParser to parse csv lines using a script like
below in spark-shell:
import au.com.bytecode.opencsv.CSVParser;
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
import org.apache.hadoop.fs.{Path, FileSystem}
class MyKryoRegistrator
30 matches
Mail list logo