By the way this happens when I stooped the Driver process ...
On Tue, May 19, 2015 at 12:29 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
You mean to say within Runtime.getRuntime().addShutdownHook I call
ssc.stop(stopSparkContext = true, stopGracefully = true) ?
This
Hi Justin,
Can you try with sbt, may be that will help.
- Install sbt for windows
http://www.scala-sbt.org/0.13/tutorial/Installing-sbt-on-Windows.html
- Create a lib directory in your project directory
- Place these jars in it:
- spark-streaming-twitter_2.10-1.3.1.jar
-
I don't think you should rely on a shutdown hook. Ideally you try to
stop it in the main exit path of your program, even in case of an
exception.
On Tue, May 19, 2015 at 7:59 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
You mean to say within
Thenka Sean . you are right. If driver program is running then I can handle
shutdown in main exit path . But if Driver machine is crashed (if you just
stop the application, for example killing the driver process ), then
Shutdownhook is the only option isn't it ? What I try to say is , just
doing
Hi Peer,
If you open the driver UI (running on port 4040) you can see the stages and
the tasks happening inside it. Best way to identify the bottleneck for a
stage is to see if there's any time spending on GC, and how many tasks are
there per stage (it should be a number total # cores to achieve
Thanks, Marcelo!
Below is the full log,
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
If you wanted to stop it gracefully, then why are you not calling
ssc.stop(stopGracefully = true, stopSparkContext = true)? Then it doesnt
matter whether the shutdown hook was called or not.
TD
On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi,
If you dont want the fileStream to start only after certain event has
happened, why not start the streamingContext after that event?
TD
On Sun, May 17, 2015 at 7:51 PM, Haopu Wang hw...@qilinsoft.com wrote:
I want to use file stream as input. And I look at SparkStreaming
document again, it's
Hi Imran,
If I understood you correctly, you are suggesting to simply call broadcast
again from the driver program. This is exactly what I am hoping will work
as I have the Broadcast data wrapped up and I am indeed (re)broadcasting
the wrapper over again when the underlying data changes. However,
You mean to say within Runtime.getRuntime().addShutdownHook I call
ssc.stop(stopSparkContext
= true, stopGracefully = true) ?
This won't work anymore in 1.4.
The SparkContext got stopped before Receiver processed all received blocks
and I see below exception in logs. But if I add the
There were some similar discussion happened on JIRA
https://issues.apache.org/jira/browse/SPARK-3633 may be that will give you
some insights.
Thanks
Best Regards
On Mon, May 18, 2015 at 10:49 PM, zia_kayani zia.kay...@platalytics.com
wrote:
Hi, I'm getting this exception after shifting my code
I am running Spark over Cassandra to process a single table.
My task reads a single days' worth of data from the table and performs 50 group
by and distinct operations, counting distinct userIds by different grouping
keys.
My code looks like this:
JavaRddRow rdd =
What happnes if in a streaming application one job is not yet finished and
stream interval reaches. Does it starts next job or wait for first to
finish and rest jobs will keep on accumulating in queue.
Say I have a streaming application with stream interval of 1 sec, but my
job takes 2 min to
it's sound good, maybe you can send me pseudo structure, that is my fist
maven project.
best regards,
paul
2015-05-18 14:05 GMT+02:00 Robert Metzger rmetz...@apache.org:
Hi,
I would really recommend you to put your Flink and Spark dependencies into
different maven modules.
Having them both
It will be a single job running at a time by default (you can also
configure the spark.streaming.concurrentJobs to run jobs parallel which is
not recommended to put in production).
Now, your batch duration being 1 sec and processing time being 2 minutes,
if you are using a receiver based
Hi all,
I have a job that, for every row, creates about 20 new objects (i.e. RDD of
100 rows in = RDD 2000 rows out). The reason for this is each row is tagged
with a list of the 'buckets' or 'windows' it belongs to.
The actual data is about 10 billion rows. Each executor has 60GB of memory.
Thanks for the quick response and confirmation, Marcelo, I just opened
https://issues.apache.org/jira/browse/SPARK-7725.
On Mon, May 18, 2015 at 9:02 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Shay,
Yeah, that seems to be a bug; it doesn't seem to be related to the default
FS nor
spark.streaming.concurrentJobs takes an integer value, not boolean. If you
set it as 2 then 2 jobs will run parallel. Default value is 1 and the next
job will start once it completes the current one.
Actually, in the current implementation of Spark Streaming and under
default configuration,
On 19 May 2015, at 03:08, Justin Pihony justin.pih...@gmail.com wrote:
15/05/18 22:03:14 INFO Executor: Fetching
http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with
timestamp 1432000973058
15/05/18 22:03:14 INFO Utils: Fetching
I tried to insert an flag in the RDD, so I could set in the last position
a counter, when the counter gets X, I could do something. But in each slide
comes the original RDD although I modificated it.
I did this code to check if this is possible but it doesn't work.
val rdd1WithFlag = rdd1.map {
Hi all,
I might be missing something, but does the new Spark 1.3 sqlContext save
interface support using Avro as the schema structure when writing Parquet
files, in a similar way to AvroParquetWriter (which I've got working)?
I've seen how you can load an avro file and save it as parquet from
Hi, experts.
I ran the HBaseTest program which is an example from the Apache Spark source
code to learn how to use spark to access HBase. But I met the following
exception:
Exception in thread main
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=36, exceptions:
That's right. Also, Spark SQL can automatically infer schema from JSON
datasets. You don't need to specify an Avro schema:
sqlContext.jsonFile(json/path).saveAsParquetFile(parquet/path)
or with the new reader/writer API introduced in 1.4-SNAPSHOT:
Thanks Cheng, that's brilliant, you've saved me a headache.
Ewan
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: 19 May 2015 11:58
To: Ewan Leith; user@spark.apache.org
Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or
createDataFrame Interfaces?
That's right.
Hi,
I am using spark 1.3.1
Regards,
Madhukara Phatak
http://datamantra.io/
On Tue, May 19, 2015 at 4:34 PM, Wangfei (X) wangf...@huawei.com wrote:
And which version are you using
发自我的 iPhone
在 2015年5月19日,18:29,ayan guha guha.a...@gmail.com 写道:
can you kindly share your code?
On
And which version are you using
发自我的 iPhone
在 2015年5月19日,18:29,ayan guha
guha.a...@gmail.commailto:guha.a...@gmail.com 写道:
can you kindly share your code?
On Tue, May 19, 2015 at 8:04 PM, madhu phatak
phatak@gmail.commailto:phatak@gmail.com wrote:
Hi,
I am trying run spark sql
Is that a Spark or Spark Streaming application
Re the map transformation which is required you can also try flatMap
Finally an Executor is essentially a JVM spawn by a Spark Worker Node or YARN –
giving 60GB RAM to a single JVM will certainly result in “off the charts” GC. I
would
Hi Ewan,
Different from AvroParquetWriter, in Spark SQL we uses StructType as the
intermediate schema format. So when converting Avro files to Parquet
files, we internally converts Avro schema to Spark SQL StructType first,
and then convert StructType to Parquet schema.
Cheng
On 5/19/15
Hi,
I am trying run spark sql aggregation on a file with 26k columns. No of
rows is very small. I am running into issue that spark is taking huge
amount of time to parse the sql and create a logical plan. Even if i have
just one row, it's taking more than 1 hour just to get pass the parsing.
Any
Thanks Cheng, that makes sense.
So for new dataframe creation (not conversion from Avro but from JSON or CSV
inputs) in Spark we shouldn't worry about using Avro at all, just use the Spark
SQL StructType when building new Dataframes? If so, that will be a lot simpler!
Thanks,
Ewan
From: Cheng
Hi,
I have fields from field_0 to fied_26000. The query is select on
max( cast($columnName as double)),
|min(cast($columnName as double)), avg(cast($columnName as double)), count(*)
for all those 26000 fields in one query.
Regards,
Madhukara Phatak
http://datamantra.io/
On Tue, May 19,
Hi Team,
I am new to Spark and learning.
I am trying to read image files into spark job. This is how I am doing:
Step 1. Created sequence files with FileName as Key and Binary image as
value. i.e. Text and BytesWritable.
I am able to read these sequence files into Map Reduce programs.
Step 2.
I
can you kindly share your code?
On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com wrote:
Hi,
I am trying run spark sql aggregation on a file with 26k columns. No of
rows is very small. I am running into issue that spark is taking huge
amount of time to parse the sql and
Hi,
An additional information is, table is backed by a csv file which is read
using spark-csv from databricks.
Regards,
Madhukara Phatak
http://datamantra.io/
On Tue, May 19, 2015 at 4:05 PM, madhu phatak phatak@gmail.com wrote:
Hi,
I have fields from field_0 to fied_26000. The query
Which user did you run your program as ?
Have you granted proper permission on hbase side ?
You should also check master log to see if there was some clue.
Cheers
On May 19, 2015, at 2:41 AM, donhoff_h 165612...@qq.com wrote:
Hi, experts.
I ran the HBaseTest program which is an
What is your spark env file says? Are you setting number of executors in
spark context?
On 20 May 2015 13:16, Shailesh Birari sbirar...@gmail.com wrote:
Hi,
I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB
of RAM.
I have around 600,000+ Json files on HDFS. Each file
Check out Apache's trademark guidelines here:
http://www.apache.org/foundation/marks/
http://www.apache.org/foundation/marks/
Matei
On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com wrote:
What is the license on using the spark logo. Is it free to be used for
displaying
Thanks!
On Wed, May 20, 2015 at 12:41 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Check out Apache's trademark guidelines here:
http://www.apache.org/foundation/marks/
Matei
On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com
wrote:
What is the license on using the
Hive on Spark and SparkSQL which should be better , and what are the key
characteristics and the advantages and the disadvantages between ?
guoqing0...@yahoo.com.hk
I think here is the PR https://github.com/apache/spark/pull/2994 you could
refer to.
2015-05-20 13:41 GMT+08:00 twinkle sachdeva twinkle.sachd...@gmail.com:
Hi,
As Spark streaming is being nicely integrated with consuming messages from
Kafka, so I thought of asking the forum, that is there
Thanks. I will try and let you know. But what exactly is an issue? Any
pointers?
Regards
Tapan
On Tue, May 19, 2015 at 6:26 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Try something like:
JavaPairRDDIntWritable, Text output = sc.newAPIHadoopFile(inputDir,
Hi,
As Spark streaming is being nicely integrated with consuming messages from
Kafka, so I thought of asking the forum, that is there any implementation
available for pushing data to Kafka from Spark Streaming too?
Any link(s) will be helpful.
Thanks and Regards,
Twinkle
Hi all,
I am trying to setup an external metastore using Microsoft SQL on Azure, it
works ok initially but after about 5 mins inactivity it hangs, then times
out after 15 mins with this error:
15/05/20 00:02:49 ERROR ConnectionHandle: Database access problem. Killing
off this connection and all
Hi
I'm learning spark focused on data and machine learning. Migrating from SAS.
There is a group for it? My questions are basic for now and I having very few
answers.
Tal
Rick.
Enviado do meu smartphone Samsung Galaxy.
Este mensaje y sus adjuntos se
Sorry, this ref does not help me. I have set up the configuration in
hbase-site.xml. But it seems there are still some extra configurations to be
set or APIs to be called to make my spark program be able to pass the
authentication with the HBase.
Does anybody know how to set authentication to
Hi,
Thanks for the response. I was looking for a java solution. I will check the
scala and python ones.
Regards,
Anand.C
From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Tuesday, May 19, 2015 6:17 PM
To: Chandra Mohan, Ananda Vel Murugan
Cc: ayan guha; user
Subject: Re: Spark sql error while
Hi,
I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB
of RAM.
I have around 600,000+ Json files on HDFS. Each file is small around 1KB in
size. Total data is around 16GB. Hadoop block size is 256MB.
My application reads these files with sc.textFile() (or sc.jsonFile()
Problem is still there.
Exception is not coming at the time of reading.
Also the count of JavaPairRDD is as expected. It is when we are calling
collect() or toArray() methods, the exception is coming.
Something to do with Text class even though I haven't used it in the
program.
Regards
Tapan
On
Hi Jim,
this is definitley strange. It sure sounds like a bug, but it also is a
very commonly used code path, so it at the very least you must be hitting a
corner case. Could you share a little more info with us? What version of
spark are you using? How big is the object you are trying to
Hi,
I'd like to confirm an observation I've just made. Specifically that spark
is only available in repo1.maven.org for one Hadoop variant.
The Spark source can be compiled against a number of different Hadoops
using profiles. Yay.
However, the spark jars in repo1.maven.org appear to be compiled
I think your observation is correct.
e.g.
http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.3.1
shows that it depends on hadoop-client
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client from
hadoop 2.2
Cheers
On Tue, May 19, 2015 at 6:17 PM, Edward Sargisson
What is the license on using the spark logo. Is it free to be used for
displaying commercially?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
The batch version of this is part of rowSimilarities JIRA 4823 ...if your
query points can fit in memory there is broadcast version which we are
experimenting with internallywe are using brute force KNN right now in
the PR...based on flann paper lsh did not work well but before you go to
Hi,
Tested for calculating values for 300 columns. Analyser takes around 4
minutes to generate the plan. Is this normal?
Regards,
Madhukara Phatak
http://datamantra.io/
On Tue, May 19, 2015 at 4:35 PM, madhu phatak phatak@gmail.com wrote:
Hi,
I am using spark 1.3.1
Regards,
I believe your looking for df.na.fill in scala, in pySpark Module it is
fillna (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html)
from the docs:
df4.fillna({'age': 50, 'name': 'unknown'}).show()age height name10 80
Alice5 null Bob50 null Tom50 null unknown
On
hmm, I guess it depends on the way you look at it. In a way, I'm saying
that spark does *not* have any built in auto-re-broadcast if you try to
mutate a broadcast variable. Instead, you should create something new, and
just broadcast it separately. Then just have all the code you have
operating
The principal is sp...@bgdt.dev.hrb. It is the user that I used to run my spark
programs. I am sure I have run the kinit command to make it take effect. And I
also used the HBase Shell to verify that this user has the right to scan and
put the tables in HBase.
Now I still have no idea how to
Hi ,
can you pls share how you resolved the parsing issue. It would be of great
help...
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-categorical-variables-tp12433p22943.html
Sent from the Apache Spark User List mailing list
Hi,
Another update, when run on more that 1000 columns I am getting
Could not write class
__wrapper$1$40255d281a0d4eacab06bcad6cf89b0d/__wrapper$1$40255d281a0d4eacab06bcad6cf89b0d$$anonfun$wrapper$1$$anon$1
because it exceeds JVM code size limits. Method apply's code too large!
Regards,
Cool. Thanks for the detailed response Cody.
Thanks
Best Regards
On Tue, May 19, 2015 at 6:43 PM, Cody Koeninger c...@koeninger.org wrote:
If those questions aren't answered by
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
please let me know so I can update it.
I guess it's a typo: eu.stratosphere should be replaced by
org.apache.flink
On Tue, May 19, 2015 at 1:13 PM, Alexander Alexandrov
alexander.s.alexand...@gmail.com wrote:
We managed to do this with the following config:
// properties
!-- Hadoop --
I was trying to implement this example:
http://spark.apache.org/docs/1.3.1/sql-programming-guide.html#hive-tables
It worked well when I built spark in terminal using command specified:
http://spark.apache.org/docs/1.3.1/building-spark.html#building-with-hive-and-jdbc-support
But when I try to
Hi,
Tested with HiveContext also. It also take similar amount of time.
To make the things clear, the following is select clause for a given column
*aggregateStats( $columnName , max( cast($columnName as double)),
|min(cast($columnName as double)), avg(cast($columnName as double)),
count(*) )*
If those questions aren't answered by
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
please let me know so I can update it.
If you set auto.offset.reset to largest, it will start at the largest
offset. Any messages before that will be skipped, so if prior runs of the
So for Kafka+spark streaming, Receiver based streaming used highlevel api
and non receiver based streaming used low level api.
1.In high level receiver based streaming does it registers consumers at
each job start(whenever a new job is launched by streaming application say
at each second)?
2.No
Hello all,
I have an error in pyspark for which I have not the faintest idea of the
cause. All I can tell from the stack trace is that it can't find a pyspark file
on the path /mnt/spark-*/pyspark-*. Apart from that I need someone more
experienced than me with Spark to look into it and help
On Tue, May 19, 2015 at 8:10 PM, Shushant Arora shushantaror...@gmail.com
wrote:
So for Kafka+spark streaming, Receiver based streaming used highlevel api
and non receiver based streaming used low level api.
1.In high level receiver based streaming does it registers consumers at
each job
Dear Experts,
we have a spark cluster (standalone mode) in which master and workers
are started from root account. Everything runs correctly to the point
when we try doing operations such as
dataFrame.select(name, age).save(ofile, parquet)
or
rdd.saveAsPickleFile(ofile)
, where
Hey Tom,
Are you using the fine-grained or coarse-grained scheduler? For the
coarse-grained scheduler, there is a spark.cores.max config setting that will
limit the total # of cores it grabs. This was there in earlier versions too.
Matei
On May 19, 2015, at 12:39 PM, Thomas Dudziak
Hi,
Can you please add us to the list of Spark Users
Org: PanTera
URL: http://pantera.io
Components we are using:
- PanTera uses a direct access to the Spark Scala API
- Spark Core SparkContext, JavaSparkContext, SparkConf, RDD, JavaRDD,
- Accumulable, AccumulableParam, Accumulator,
I'm using fine-grained for a multi-tenant environment which is why I would
welcome the limit of tasks per job :)
cheers,
Tom
On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hey Tom,
Are you using the fine-grained or coarse-grained scheduler? For the
I read the other day that there will be a fair number of improvements in
1.4 for Mesos. Could I ask for one more (if it isn't already in there): a
configurable limit for the number of tasks for jobs run on Mesos ? This
would be a very simple yet effective way to prevent a job dominating the
Hi,
Can anybody see what's wrong in this piece of code:
./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile(/user/p_loadbd/fraude5.csv).map(x =
Just to add, there is a Receiver based Kafka consumer which uses Kafka Low
Level Consumer API.
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Regards,
Dibyendu
On Tue, May 19, 2015 at 9:00 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
On Tue, May 19, 2015 at 8:10 PM,
Hi Keerthi
As Xiangrui mentioned in the reply, the categorical variables are assumed
to be encoded as integers between 0 and k - 1, if k is the parameter you
are passing as the category info map. So you will need to handle this
during parsing (your columns 3 and 6 need to be converted into ints
Thanks Imran. It does help clarify. I believe I had it right all along then
but was confused by documentation talking about never changing the
broadcasted variables.
I've tried it on a local mode process till now and does seem to work as
intended. When (and if !) we start running on a real
Try something like:
JavaPairRDDIntWritable, Text output = sc.newAPIHadoopFile(inputDir,
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class,
IntWritable.class,
Text.class, new Job().getConfiguration());
With the type of input format that you require.
Thanks
Best
You may want to look at this tooling for helping identify performance
issues and bottlenecks:
https://github.com/kayousterhout/trace-analysis
I believe this is slated to become part of the web ui in the 1.4 release,
in fact based on the status of the JIRA,
Please take a look at:
http://hbase.apache.org/book.html#_client_side_configuration_for_secure_operation
Cheers
On Tue, May 19, 2015 at 5:23 AM, donhoff_h 165612...@qq.com wrote:
The principal is sp...@bgdt.dev.hrb. It is the user that I used to run my
spark programs. I am sure I have run
Yeah, this definitely seems useful there. There might also be some ways to cap
the application in Mesos, but I'm not sure.
Matei
On May 19, 2015, at 1:11 PM, Thomas Dudziak tom...@gmail.com wrote:
I'm using fine-grained for a multi-tenant environment which is why I would
welcome the limit
Thanks Akhil andDibyendu.
Does in high level receiver based streaming executors run on receivers
itself to have data localisation ? Or its always data is transferred to
executor nodes and executor nodes differ in each run of job but receiver
node remains same(same machines) throughout life of
Hi Justin Ram,
To clarify, PipelineModel.stages is not private[ml]; only the PipelineModel
constructor is private[ml]. So it's safe to use pipelineModel.stages as a
Spark user.
Ram's example looks good. Btw, in Spark 1.4 (and the current master
build), we've made a number of improvements to
The documentation needs to be updated to state that higher metric
values are better (https://issues.apache.org/jira/browse/SPARK-7740).
I don't know why if you negate the return value of the Evaluator you
still get the highest regularization parameter candidate. Maybe you
should check the log
Hi,
We would like to be added to the Powered by Spark list:
organization name: Localytics
URL: http://eng.localytics.com/
a list of which Spark components you are using: Spark, Spark Streaming,
MLLib
a short description of your use case: Batch, real-time, and predictive
analytics driving our
The way these files are accessed is inherently sequential-access. There
isn't a way to in general know where record N is in a file like this and
jump to it. So they must be read to be sampled.
On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Hi
Under certain circumstances that I haven't yet been able to isolate, I get
the following error when doing a HQL query using HiveContext (Spark 1.3.1
on Mesos, fine-grained mode). Is this a known problem or should I file a
JIRA for it ?
org.apache.spark.SparkException: Can only zip RDDs with same
customerDF.groupBy(state).agg(max($discount).alias(newName))
(or .as(...), both functions can take a String or a Symbol)
On Tue, May 19, 2015 at 2:11 PM, Cesar Flores ces...@gmail.com wrote:
I would like to ask if there is a way of specifying the column name of a
data frame aggregation. For
With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B
floats to store the model. That is 64GB. We store the model on the
driver node in the current implementation. So I don't think it would
work. You might try increasing the minCount to decrease the vocabulary
size and reduce the
Does Spark 1.3.1 support Hive 1.0? If not, which version of Spark will
start supporting Hive 1.0?
--
Kannan
Thanks for asking! We should improve the documentation. The sample
dataset is actually mimicking the MNIST digits dataset, where the
values are gray levels (0-255). So by dividing by 16, we want to map
it to 16 coarse bins for the gray levels. Actually, there is a bug in
the doc, we should convert
I would like to ask if there is a way of specifying the column name of a
data frame aggregation. For example If I do:
customerDF.groupBy(state).agg(max($discount))
the name of my aggregated column will be: MAX('discount)
Is there a way of changing the name of that column to something else on
PySpark work with CPython by default, and you can specify which
version of Python to use by:
PYSPARK_PYTHON=path/to/path bin/spark-submit xxx.py
When you do the upgrade, you could install python 2.7 on every machine
in the cluster, test it with
PYSPARK_PYTHON=python2.7 bin/spark-submit xxx.py
You need to convert DataFrame to RDD, call sampleByKey, and then apply
the schema back to create DataFrame.
val df: DataFrame = ...
val schema = df.schema
val sampledRDD = df.rdd.keyBy(r = r.getAs[Int](0)).sampleByKey(...).values
val sampled = sqlContext.createDataFrame(sampledRDD, schema)
If you checkpoint, the job will start from the successfully consumed
offsets. If you don't checkpoint, by default it will start from the
highest available offset, and you will potentially lose data.
Is the link I posted, or for that matter the scaladoc, really not clear on
that point?
The
Hi
I'm using spark 1.3.1 and now I can't set the size of the part generated
file for parquet.
The size is only 512Kb it's really to small I must made them bigger.
How can set this ?
Thanks
Thank you !
Le mar. 19 mai 2015 à 21:08, Xiangrui Meng men...@gmail.com a écrit :
In 1.4, we added RAND as a DataFrame expression, which can be used for
random split. Please check the example here:
https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214.
Hey Jaonary,
I saw this line in the error message:
org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(dataTypes.scala:163)
CaseClassStringParser is only used in older versions of Spark to parse
schema from JSON. So I suspect that the cluster was running on a old
version of Spark
We use a dense array to store the covariance matrix on the driver
node. So its length is limited by the integer range, which is 65536 *
65536 (actually half). -Xiangrui
On Wed, May 13, 2015 at 1:57 AM, Sebastian Alfers
sebastian.alf...@googlemail.com wrote:
Hello,
in order to compute a huge
I'm not sure whether k-means would converge with this customized
distance measure. You can list (weighted) time as a feature along with
coordinates, and then use Euclidean distance. For other supported
distance measures, you can check Derrick's package:
Just curious, what distance measure do you need? -Xiangrui
On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
take a look at this
https://github.com/derrickburns/generalized-kmeans-clustering
Best,
Jao
On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko
1 - 100 of 120 matches
Mail list logo